MEASUREMENT — 


AND EVALUATION 
IN TEACHING 


Norman D. Gronlund 


measurement 
and evaluation ill 
teaching 


“MEASUREMENT 
AND EVALUATION 
IN TEACHING 


ELEAN 

Norman E. Gronlund 

Professor of Educational Psychology 
University of Illinois 


THE MACMILLAN COMPANY, NEW YORK 
COLLIER-MACMILLAN LIMITED, LONDON 


S.C.E R T., West “pagal © Copyright, NORMAN E. GRONLUND, 1965 


All rights reserved. No part of this book may 
Kan. N =e Wl ë a be reproduced or utilized in any form or by 
any means, electronic or mechanical, includ- 
ing photocopying, recording or by any infor- 


R g Š 
? \ L 7 mation storage and retrieval system, without 
Q } f permission in writing from the Publisher. 
Ej i‘ 26 
First Printing 
(ko | f 
Library of Congress catalog card number: 65-12726 
THE MACMILLAN COMPANY, NEW YORK 
COLLIER-MACMILLAN CANADA, LTD., 
TORONTO, ONTARIO 


Printed in the United States of America 


preface 


Si se ee UE ant ae 
1 US pa as £= 


This book introduces the teacher and prospective teacher to the principles and 
procedures of evaluation which are essential to good teaching. The main theme 
which runs throughout the book is that evaluation is an integral part of the 
teaching-learning process and that it involves three fundamental steps: (1) identi- 
fying and defining instructional objectives in behavioral terms, (2) constructing 
or selecting evaluation instruments which most effectively appraise these specific 
learning outcomes, and (3) using the results to improve learning. 

The book was designed primarily for elementary and secondary teachers, and 
the examples and illustrations reflect this orientation. The focus on principles, 
however, should make it useful to teachers at all levels and in all areas of in- 
struction. Administrators, supervisors, and counselors should also find the 
material helpful in their work. 

Special efforts were made to keep the book interesting and understandable 
without slighting basic concepts, like validity and reliability, or sacrificing 
technical accuracy. The writing style is direct, and practical examples are used 
to clarify difficult points. Liberal use is made of sample test items and excerpts 
from evaluation devices to illustrate principles and techniques. 

In keeping with the focus of the book, statistical procedures receive minor 
attention in the main body of the text. They are introduced only when essential 
for understanding the discussion and then the emphasis is on interpretation rather 
than computation. For those who want to acquire a minimum level of computa- 
tional skill in statistics, a special section is provided in the appendices. 

Lengthy descriptions of standardized tests have been excluded from the book. 
In the section on standardized testing, stress is placed on how to locate, select, 
and use published tests wisely. Some of the leading tests in each area of measure- 
ment are referred to in the text and briefly described in the appendices, but these 
are intended only as orientation for the beginner. With the thousands of tests 


vii 


viii Preface 


now available, it is more important to become familiar with such selection guides 
as Buros’ Tests in Print and Mental Measurements Yearbooks than with the 
characteristics of any particular list of tests. 

The special emphases in the book are probably best indicated by listing the 
learning outcomes which should result from its use. These expected outcomes are: 


1. An understanding of the interrelated nature of teaching, learning, and 
evaluation. 

2. The ability to clearly define objectives in behavioral terms. : 

3. An understanding of the concepts of validity and reliability and their 
role in constructing, selecting, and using evaluation instruments. 

4. The ability to construct classroom tests which measure specific learning 
outcomes (from simple to complex). 

5. The ability to select standardized tests which are most appropriate for a 
particular situation. 

6. The ability to administer, score, and use test results (with due regard to 
the necessary precautions). 


The ability to interpret test scores (with full awareness of the ever-present 
error of measurement). 


e° 


. The ability to construct. select, and use nontest evaluation instruments. 
An appreciation of both the potentialities and the limitations of the 
various tests and evaluation procedures used in teaching. 


An understanding of how evaluation procedures can contribute to the 
teaching-learning process. 


10. 


It is not intended that these outcomes be attained by reading, alone. Many 
of them refer to skills, which can be developed only through practical experience. 
I have found it useful to have students develop a test construction project for 
some instructional unit, and to select and review a few standardized tests in 
their own teaching area. Those actively engaged in teaching are also encouraged 


to apply the principles and procedures in their classrooms as rapidly and 
frequently as is feasible. 


My approach to measurement and evalu 


ation has been shaped largely by 
the work of Ralph W. Tyler and by the 


writings of L. J. Cronbach, P. L. 
list, R. L. Thorndike, and R. M. W. 


everything written in this book, they 


NORMAN E. GRONLUND 


SPECIAL ACKNOWLEDGMENT 


Jean C. Angley provided very special 
assistance during the writing of this 
book. She contributed case material and 
other written material to the first two 
chapters and made valuable suggestions 
for improving the style and clarity of 
writing throughout the entire manuscript. 
Her contributions have helped to make 
the book more interesting and lucid. 


NORMAN E. GRONLUND 


AR UNAN 


oS AND 


12. 


=a ae 


oe die Na 


PART I. 


The Evaluation Process 


. The Role of Evaluation in Teaching 

. Defining Objectives for Evaluation Purposes 

. Relating Evaluation Procedures to Objectives 
. The Validity of Evaluation Results 


. Other Characteristics Desired in Evaluation Procedures 


PART II. 


Constructing Classroom Tests 


. Principles and Procedures of Classroom Testing 

. Constructing Objective Test Items: Simple Forms 

. Constructing Objective Test Items: Multiple-Choice Form 

. Measuring Complex Achievement: The Interpretive Exercise 
10. 
11. 


Measuring Complex Achievement: The Essay Test 


Preparing, Administering, and Appraising Classroom Tests 


PART III. 
Using Standardized Tests 


Standardized Tests of Achievement and Scholastic Aptitude 


xi 


103 
120 
140 
160 
180 
194 


221 


xii Contents 


13. Selecting and Using Standardized Tests 


14. Interpreting Test Scores and Norms 


PART Iv. 
Evaluating Procedures, Products, and Typical Behavior 
I5. Evaluating Learning and Develo 


16. Evaluating Learnin 
Report 


pment: Observational Techniques 


g and Development: Peer Appraisal and Self- 


PART V. 
Using Evaluation Results in Teaching 


17, Improving Learning, Marking, and Reporting 


APPENDICES 


Appendix A Elementary Statistics 


Appendix B 
Appendix C 
Appendix D 


Table of Squares and Square Roots 
A List of Test Publishers 

A Selected List of Standardized Tests 
Subject Index 

Author Index 


246 
272 


307 


331 


361 


385 
401 
406 
407 


412 
419 


PART I 
the evaluation 


process 
ci Pa 1 sc tes | 


Chapter 1 
the role of 


evaluation 
in teachin 


Evaluation techniques are indispensable tools to the teacher. . . . How- 
ever, evaluation is not merely a collection of techniques—evaluation is a 
process—it is a continuous process which underlies all good teaching and 
learning. 


Some form of evaluation is inevitable in teaching. It is as inevitable in class- 
room teaching as it is in all fields of activity when judgments need to be made, 
however simple or complex the consideration involved—whether it is a matter 
of deciding what to wear today or what profession to follow through the ensuing 
years, Unfortunately, evaluation in the classroom is all too often done as though 
it were extraneous to the main purpose of teaching. 

When crossing a little-used country lane, we normally decide when it is safe 
to cross with little awareness of having made a judgment. On the other hand, 
a test pilot is quite aware of making a judgment when he decides to fly a new 
airplane. His life depends upon meticulous inspection of all tabulations of 
laboratory tests, reports from the weather station, medical reports about his 
own physical condition and capabilities and, finally, upon his judgment based 
on past experience and every bit of reliable information which it has been 
possible to gather about test-flying in general and about test-flying this new 
plane in particular. A test pilot who failed to make such a carefully considered 
judgment and relied instead on his “feeling” of how the plane looked or the 
motors sounded, would obviously be accepting a greater risk than most of us 
would care to incur. 

The possible aftermath of haphazard, or wholly subjective evaluation, in a 
teaching-learning situation is not likely to be quite so dire as in the case of a 


3 


4 The Evaluation Process 


foolhardy test Pilot. But since the evaluations teachers make can have a tre- 
mendous influence on the lives of their students, they 
and certainly never casually made. The 
teaching-learning situation that even ha 
advantages of a systematic use of plann 


š for 
Teachers have at their disposal a great variety of sources and methods aA 
gathering information about their pupils. Each source or method may be u 


f ; ore 
singly for a particular purpose, but the combined use of several is far m 
common. Let us take the example of a te 


a class into two or three groups for ins 
It had become obvious to Miss 
span of abilities in dealing with 
(hidden, he thought) during the 
too easy, while Billy shredded cr. 
apples and oranges and pies and 
Sevens made any sense at all! 
and recollection of the two bo 


knew that Max and Bill 


should not be lightly made, 
role of evaluation is so intrinsic to the 
sty consideration seems to indicate the 
ed evaluation procedures. 


š f ns 
ys' performances on written work, Miss ies 
rent groups. This way of making judg 


2 a ADEE 
cross the country lane: it does not require sy 


i i t 
it was apparent to Miss Evans tha 
of her pupils’ abilities and achieve 


s ° and 
Sections would be in Max’s group an 

Miss Evans was encouraged; she had made a begi 
the test Scores agaj 


Scores on quantitative and arithmetic reasonin, 
Since all the “doubtful” 


ed test Scores, provided her with the 


*s group would spend 
concrete objects; the 
new” arithmetic facts 
eighteen others would continue 
al until she could construct some 
do that was to consult the “review” 


discovering < 
stery in “old” facts; the 


`Brade materi 


ecided the best way to 


The Role of Evaluation in Teaching 5 


sections of the first-grade arithmetic program and use those as sources from 
which to draw items for diagnostic tests. With those temporary plans made, she 
had finished her work for the day. 

Within the next two weeks, however, she administered and scored four ten- 
minute diagnostic tests she had devised to cover the essentials of the first-grade 
arithmetic program. The scores revealed that the assignments to the three groups 
had been, by and large, sound ones. Of the larger group, a few needed more 
experience in dealing with money, three needed some review for telling time, 
about half required extensive review of the relationship between cardinal and 
ordinal numbers (enough to make review for all beneficial), but the most worth- 
while discovery she made concerned those who had scored high in reasoning 
but low in computation. All of them had encountered their difficulties in making 
the transfer from the concrete to the abstract. Since success at this juncture is 
fundamental, she wisely planned a series of lessons for the entire group to 
proceed repeatedly from concrete through semiconcrete to abstract and “back- 
ward”—from abstract through semiconcrete to concrete. 

Miss Evans’ over-all objective from the beginning, of course, was to help 
children learn second-grade arithmetic as determined by the school curriculum 
and the text. As we have observed, however, she discovered that not all of her 
pupils were ready for learning second-grade arithmetic. By combining several 
evaluative procedures—daily observation in the classroom, classroom written 
work, standardized test scores, and diagnostic tests—she had found a sound 
basis for developing learning experiences leading toward the over-all objective. 
She may not have been aware of it at the time, but she had also laid the founda- 
tion for evaluating future achievement. With the accumulation of daily papers, 
scores on tests provided in the text, and scores of achievement tests she con- 
structed, she would be ready to make her own evaluative judgments concerning 
pupil progress and she would be prepared to show parents objective evidence 
to support her judgments. Parents could see for themselves just where their chil- 
dren had started and what progress they had made. 

The purpose, then, of a book on measurement and evaluation is to help teach- 
ers make better evaluative judgments. You should be warned at the outset, 
however, that you will find no magic solutions to your educational problems in 
test scores or the results of checklists or rating scales. There is no more magic 
in these than there is an absolute guarantee in laboratory findings that a test 
pilot will survive. What you will find is exactly what Miss Evans found and 
what a pilot finds in his specialized tabulations and reports—more objective 
information on which to base decisions. 

As we guide pupils toward the achievement of classroom objectives, diagnose 
their learning difficulties, determine their readiness for new learning experiences, 
place them in classroom groups for special activities, assist them with their 
problems of adjustment, and prepare reports of pupil progress for parents, we 
cannot escape making evaluative judgments. Decisions must be made and ac- 
tion must be taken. The more accurately we judge our pupils, the more effective 
we will be in directing their learning. An understanding of the principles and 


6 The Evaluation Process 


š š a ° A de- 
procedures of evaluation, then, should aid us in making more intelligent 
cisions in directing pupil progress toward worthwhile educational goals. 


THE MEANING OF EVALUATION 


As is common with terms which are part of our general vocabulary, there is 
some confusion concerning the meaning of the term ev 
especially to education. In some instances. ; 
measurement. Thus, a teacher who administers an achievement test might say 
either that he is “measuring” achievement or that he is “evaluating” achieve- 
ment, with little regard for the specific meaning of the two terms. In other cases, 
“evaluation” is used as a collective 
not depend on “measurement.” 
ations as qualitativ 
scales, and so on) a 


aluation as it applies 
it is used as a synonym for the term 


F ‘hich do 
term for those appraisal methods which : 
Pr “ s 
` This use of the two terms distinguishes eva a 
š or e ists n 
e descriptions of pupil behavior” (i.e., checklists, rating 


" ps : Sn ip- 
S opposed to “measurements,” which are quantitative descrip 
tions (i.e., test scores). When the m 


it is easy to understand how these misconceptions came about. 3 

From an educational Viewpoint, evaluation may be defined as a menmi 
process of determining the extent to which educational objectives are achieve 
by pupils. There are two important aspects of this definition. First, note that 


ae We : à : va- 
evaluation implies a systematic process, which omits casual uncontrolled obser 
tion of pupils. Second. evaluation alw 


have been Previously identified. Wi 

(goals), it is dge the extent of progress. d 
This definition indi is a much more comprehensive an 

ncludes both qualitative and quan- 


š ; e 
value judements concerning th 


; san š ed, 
eaning of the term evaluation is analyz 


e behavior measured. The follow- 
ing diagrams clearly show the relationship between measurement and evaluation 
Evaluation — Quantitative description of pupils (measurement) 
+ Value judgments 
Evaluation — 


Qualitative description of pupils (nonmeasurement) 
+ Value judgments 


evaluation may or ma 


yond the simple quanti 
evalua 


y not be based on measurement, 


y can a pupil do multiplica- 


a ? 
anding of the number system: 


h other pupils in small groups 


- how much? Has he made any improvement 
1 Goals, objectives, and 


desired outcomes arı 
we expect from the educati 


e used interchangeably to identify the results 
onal process, 


The Role of Evaluation in Teaching 7 


in using his time effectively? If so, how much? Is his handwriting more legible? 
If so, how much more? These questions are typical of those which we must be 
prepared to ask ourselves and to answer about each of our pupils. A variety of 
methods are therefore necessary, and a sound evaluation program will include 
both measurement and nonmeasurement techniques, each to be used as appro- 
priate. 


EVALUATION IN THE SCHOOL PROGRAM 
Evaluation and the Teaching-Learning Process 


Broadly conceived, the main purpose of classroom teaching is to change pupil 
behavior? in desired directions. When viewed in this light, evaluation becomes 
an integral part of the teaching-learning process. The “desired directions” are 
the educational objectives established by the school and the teacher; evaluation 
is the process of determining the extent to which these objectives are being 
achieved. While the interdependent nature of teaching and learning is beyond 
dispute, the interdependent nature of teaching, learning, and evaluation is less 
often recognized. The author believes that the interdependence of these three 
facets of education, however, is clearly recognizable in the following steps 
included in the educational process: 

1. Identifying and Defining Objectives in Terms of Desired Changes 
in Pupil Behavior. The first step in both teaching and evaluation is that of 
determining the learning outcomes to be expected from classroom instruction. 
What should pupils be like at the end of the learning experience? In other words, 
what kind of learning product is being sought? What knowledges and under- 
standings should the pupils possess? What skills should they be able to display? 
What interests and attitudes should they have developed? What changes in 
habits of thinking, feeling, and doing should have taken place? In short, what 
specific behavior changes are we striving for, and what are pupils like when we 
have succeeded in bringing about those changes? 

Only by identifying objectives and stating them clearly in terms of specific 
behavior can we provide direction to the teaching process and set the stage for 
ready evaluation of learning outcomes. You may recall that it was only after 
Miss Evans recognized that all of her pupils must be able to make the association 
between concrete objects and abstract number symbols that she was able to 
make definitive judgments about her pupils’ progress. This step is so vital to 
the total role of evaluation that the next chapter is entirely devoted to the proc- 
ess of identifying and defining educational objectives. 

2. Planning and Directing Learning Experiences in Harmony with 
Stated Educational Objectives. This is the point at which course content 
and teaching methods are integrated into planned learning experiences so that 
pupil behavior will change, we hope, in the desired direction. Here the emphasis 
is on the process, rather than the product, of learning, as illustrated by Miss 


2 Refers to all changes in the intellectual, emotional, and physical sphere. 


8 The Evaluation Process 


Evans' three different teaching plans for her three different BrOups: ee 
evaluation is only indirectly related to this step (in that later evaluations of : E 
results of learning can be used to appraise the effectiveness of the eo 
experiences), what has been implied before should now be emphasized. e 
the planning and directing of learning experiences is preceded by a clear s ee 
ment of objectives and followed by an evaluation of pupil BROBTESS: rt w. 
Possible to determine the eflectiveness of the learning experiences which 
Spend so many hours planning and preparing for. 

3. Determining Pupil Pro 
tives. While the first ste 
answers the question w 
evaluation. How do we 
How.do we choose the 
struct, or select, speci 
Score those techniques 
for instance, a rough 
did it. Since these are 
ress, the major portion 


4. Using the Resul 


Pupil progress can, and should, contribute directly to 
s. Evaluation procedures help clarify 
rn. They provide him with concrete 
help him recognize areas of learning 
upil and the teacher, his readiness for 
uides, again for the pupil as well as the 
teacher, for selection of ing experiences and pupil placement 1” 
special groups. It behooves us, therefore, to choose and construct evaluation 
procedures with great care. 
Many of the indirect uses of results 
monly recognized; for example, repo 
indirect uses are less often realized, Inf, 


| 


The Role of Evaluation in Teaching 9 


Other Uses of Evaluation Results 


As the main purpose of teaching is to change pupil behavior, the main purpose 
of evaluation is to improve learning and instruction. All other uses are second- 
ary or supplementary to this major purpose. At this point, however, some of 
the more important supplementary uses will be discussed briefly. 

1. Reporting Pupil Progress to Parents. The systematic use of evalua- 
tion procedures in the classroom provides the teacher with an objective and 
comprehensive picture of each pupil’s progress. Whether this report is presented 
to parents in writing, or orally in teacher-parent conferences, the objectivity 
apparent in planned measurement and evaluation procedures enables the teacher 
to focus on the pupil’s actual school achievement instead of resorting to unsub- 
stantiated generalities. The comprehensive nature of the evaluation process also 
equips the teacher to report on the total development of the pupil rather than on 
a limited area of it. This kind of over-all objective information about pupils 
provides the foundation for the most effective cooperation between parents and 
teachers. 

2. Use in Guidance and Counseling. The results of evaluation procedures 
are especially useful for guidance and counseling. Assisting a pupil with edu- 
cational and vocational decisions, guiding him in the selection of curricular and 
extracurricular activities, and helping him solve personal and social adjustment 
problems, all require an objective knowledge of the pupil’s abilities, interests, 
attitudes, and other personal characteristics. The more comprehensive the pic- 
ture of the pupil’s strengths and limitations in various areas, the greater the 
likelihood of effective guidance and counseling. 

3. Use in School Administration. Just as planned appraisal methods help 
the teacher determine the effectiveness of course content and teaching methods, 
a comprehensive continuous evaluation program in the school aids the adminis- 
trator. From the collected data, he is able to judge the extent to which the 
objectives of the school are being achieved, to identify strengths and weaknesses 
in the curriculum, and to appraise special programs in the school. Evaluation 
also provides the information on which to base administrative decisions con- 
cerning the placement, grouping, and promotion of pupils. In the public rela- 
tions area, an evaluation program is indispensable for gathering the objective 
data to be used in interpreting to the community the goals and accomplishments 


of the school. __ << 
TYPES OF EVALUATION PROCEDURES 


One of the distinctive features of the evaluation process is the use of a wide 
variety of procedures. As indicated earlier, some of these may be classified as 
quantitative techniques because the results can be reduced to numerical scores. 
Other means of appraisal are classified as qualitative techniques because their 
results can be expressed only in verbal descriptions. In addition to this broad 
classification, there are two major ways of classifying evaluation procedures: 


10 The Eualuation Process 


ated to put forth his best effort. In short, the evalua- 
an individual can do. Aptitude and achievement tests 
gory. These two types of tests are commonly distin- 
of the results rather than by the qualities of the tests 
test is primarily designed to predict success in some 
while an achievement test is designed to indicate Sega 
earning activity. Since some tests may be used for both 
purposes, however, it is obvious that the difference is mainly a matter of em- 


s ° he 
phasis. For example, to measure achievement at tl 
end of the course ma 


> 
ent provide useful designations for discussions of 
measures of ability. 
The second subdivision in this classification of Procedures includes those 
designed to reflect 


numerous illustrations of 


l area of personality ap- 
udes, and various aspects 
ategory. While this is an 
l behavior, evaluations of 


The Role of Evaluation in Teaching 11 


Classification by Evaluative Method 


Information for evaluation may be obtained by presenting an individual with 
a given set of tasks to perform, by asking him questions about himself, or by 
asking other persons to observe and judge his behavior. These three general 
methods of obtaining data designate the major categories used in classifying 
types of evaluation techniques. They are commonly referred to as (1) testing 
procedures, (2) self-report techniques, and (3) observational techniques. 

1. Testing Procedures. A test is merely a series of tasks which is used 
to measure a sample of a person’s behavior at a given time. The most common 
tests used in schools, of course, are achievement tests. They may be oral or 
written. Written tests can be further subdivided in a number of ways. One is to 
distinguish between teacher-made and specialist-made tests. Those constructed 
by the classroom teacher are generally called informal tests, while those designed 
by specialists and administered, scored, and interpreted under standard condi- 
tions are called (logically) standardized tests. 

A second common classification of written achievement tests is that of essay 
versus objective testing. Although the differentiation here is obvious, there is an 
Important difference in scoring the two types. The essay test is subjectively 
Scored—that is, the opinion of the scorer influences the results. In contrast, 
objective tests can be objectively scored, which simply means that equally com- 
Petent scorers obtain the same results. There is a misguided tendency to assign 
Somewhat esoteric virtues to objective tests on the single basis of objectivity of 
Scoring, as though thoughtful and careful eS > on -a 
Serves only s ' ‘deration. This pitfall should be assiduously avor ed. 

9 other chee: ewe the term “objective” in the designation “objec- 


tive test.” 
ement tests is by the function they are 


survey tests, or diagnostic 
measures the knowledges, 


t Still another classification of achiev 
9 serve. On this basis they are labeled mastery tests, 


tests: all th bjective. A tery test 
3 tive. mastery Š 
skill zeer eo ok Aka spa: I] pupils must acquire. Tests of basic 


S, and other learnin tcomes that a 
h ° g outco ; I : 
skills, such as command of the number facts of arithmetic, are heey ae 
tion of thi ini standard which all students 
this ty + tests set a minimum as a 
ape Mastery tes n the tests are expected to 


are expec i ke errors 0 
š ted to achieve, and those who make 
rill until they are able to make perfect or near-perfect scores. In survey tests 


© emphasis is on general achievement. In contrast to mastery tests, ai 
“re is to measure the extent of difference in various pupils in hag atte "W 
the total test score and the varying degrees of success among pupi eT the 
ocal point. Miss Evans. for instance, might have also totaled each child's scores 
of the four eepandiiits, tests to get an indication of general achievement for 
Comparison with scores made on the standardized tests given s > a an 
ince that was not her primary purpose; however, the four = ie rll ee y 
"agnostic tests, which are constructed so that part scores n = p a 
Tesponses reveal specific disabilities and deficiencies In achievement. a ate 
a then, of a diagnostic test is of little or no interest. In summary, the three 


12 The Eualuation Process 


types of achievement tests might be distinguished as 
concerned with minimum achieuement, survey tests wi 
and diagnostic tests with the iden 

Intelligence tests, necessarily 
and are also classified in vari 


follows: mastery tests are 
th maximum achievement, 
tification of specific disabilities. 

standardized, are also commonly used in schools 
ous ways. Individual intelligence tests generally 
servation of the individual's reaction 
person can be tested at a time. Their 
ning. Group intelligence tests require 
ion; consequently, many individuals 
ntelligence tests are further classified 
d. If the individual taking the test 18 
tus (often a part of individual tests): 
is also applied to certain achievement 
nguage is used in responding, it is designated a 
n terms of reaction to or manipulation of pictures 


or geometric figures, it js classified as a nonverbal test. Nonverbal tests are 
commonly used with young children, the mentally retarded, and others who are 
deficient in verbal skills, 


In addition to th i 
classified as being 
is composed of ite 


verbal test. If the response is i 


y mentioned, various types of tests are 
Speed or primarily power tests. A speed test 
ely the same level of difficulty, scored accord- 
ompleted in a given time. The time limit is 


either primarily 
ms of approximat 


ies of relatively simple tasks; the purpose of 4 


peed tests are most 


ment in typi as of 
eat ability, yPing, shorthand, and other are. 


2. Self-Report Techniques, Every individual has a wealth of information 
Pout himself. Some of this info i 


evaluation Purposes 
tionnaire, 


> this information i 


The Role of Evaluation in Teaching 13 


unstructured type. The flexibility of the unstructured interview makes it possible 
for the interviewer to pursue promising leads which arise spontaneously and 
for the interviewee to elaborate upon his answers until he is certain that his 
feelings and attitudes are clearly understood. This flexibility of response is 
seldom found in other evaluation techniques. 

The questionnaire method of obtaining information from individuals is most 
commonly used in systematic attempts to evaluate interests. attitudes, and other 
aspects of personal and social adjustments. It has an advantage over the inter- 
view method in that it can be used in group situations. Each individual is 
presented with a series of questions or statements to which he must respond 
by answering “yes” or “no,” “agree” or “disagree,” or in some similar manner 
indicate his feelings and opinions. Such self-report methods are also referred 
to as inventories. Thus, “personality questionnaire” and “personality inventory” 
are used interchangeably. Although these devices are often referred to as “tests” 
(i.e., personality tests, interest tests. and so on), this is a misnomer. Sinca evalua- 
tion by means of inventory or questionnaire depends on an individual s descrip- 
tion of himself, classifying such techniques as tests is very misleading and 
should be avoided. 

The reader has undoubted! 
tered in using questionnaire and inventory techniques. 
both depends on the willingness of the individual to give honest answers, they 
are decidedly limited for purposes of evaluating pupil progress. They are prob- 
ably most useful for guidance purposes, where the individual s desire to under- 
stand himself and to make wise future plans encourages him to describe himself 
as he actually is rather than as he would like to be, or would like others to 
believe he is. 

3. Observational Techniques. 


y already identified the inherent problem encoun- 
š Since the usefulness of 


Reliable information about an individuals 


typical, or usual. behavior is best obtained from persons who have observed him 
in a variety of situations. In the case of school pupils, both i ee the 
Pupils’ peers have numerous opportunities to observe their typica 5 ane 
The various observational techniques are merely systematic methods of record- 
ing these observations for purposes of evaluation either at the moment or a 
later date. Anecdotal records, checklists, and sociometric tech- 


niques are included in this category- j ‘ 
Anecdotal records provide the least structured method of recording behavioral 


observations. In essence, an anecdotal record is merely a brief description of 
some observed behavior which appeared significant for evaluation a 
It is a sort of verbal picture of an incident in the daily life a T popr- u 
dotal records are generally obtained periodically for each pupi $ e; es s 
describe those episodes most typical of the pupil’s behavior, always e ns i 
ence to the educational outcomes being evaluated. They F bee aan 4 
therefore, for evaluating outcomes in the area of personal Y° oo i sd 
A checklist is a prepared list of statements relating to i 3 ti a à 
formance in some area, or a product of some performance. “ac i. po 
the list is checked in some way to indicate presence or absence of a particular 


rating scales, 


14 The Evaluation Process 


d 
quality. Checklists require “all” or “none” responses. They are frequently use 


i iviti i onal 
to evaluate aspects of pupils’ interests, attitudes, activities, skills, and pers 
characteristics, 


closely related to sociometric procedures is the “guess who” tech- 
presented a number of positive and negative 

ted to name the individuals in the e a 
best fit each description. The extent to which a description is characteristic o 
an individual is determined by the number of times he has been named by the 


r that particular behavior description. An analysis of such data 
indicates how an indivi i i 


evaluating Personality and character traits. 


may be viewed: 


The Role oj Evaluation in Teaching 15 


1. Determining and clarifying what is to be evaluated always has priority in 
the evaluation process. No evaluation device should be selected or developed 
until the purposes of evaluation have been carefully defined. In terms of evalu- 
ating pupil progress, this means that the identification and definition of educa- 
tional objectives, to which reference has already been made, is always the first 
order of business. As we shall see later in the text, the effectiveness of the evalu- 
ation process depends as much upon a careful description of what to evaluate 
as it does upon the technical qualities of the evaluation instruments used. 

One of the factors which has retarded development in the measurement of 
human behavior generally has been the concentration upon techniques rather 
than process. All too frequently, tests and other evaluation instruments have 
been developed and used without a clear notion of the characteristics being 
measured. Even the concept of intelligence has been only vaguely defined by 
those developing intelligence tests. Recent efforts to describe the components of 
the intellect more clearly? and to define educational outcomes more precisely* 
indicate a desirable trend toward greater concern with a careful definition of 
the aspects of human behavior to be measured. Future progress in the area of 
educational measurement and evaluation depends heavily upon our ability to 
define in precise terms those aspects of pupil behavior which are regarded as 
Significant to the educative process. 

2. Evaluation techniques should be selected in terms of the purposes to be 
served. When the aspect of pupil behavior to be evaluated has been precisely 
defined, the evaluation technique which is most appropriate for evaluating that 
aspect of pupil behavior should be selected for use. All too frequently evaluation 
techniques are selected on the basis of how accurately they measure, how ob- 
jective the results are, or how convenient they are to use. All of these criteria 
are important but secondary to the main criterion—is this evaluation technique 
the most effective method for determining what we want to know about the 
Pupil? Each evaluation technique is appropriate for some purposes and inappro- 
Priate for others, The appropriateness of the technique for the intended purpose 
should be the prior consideration in its selection. i 

Fruitless discussions concerning the relative merits of various ty i 
ation procedures are avoided, if this principle is followed. A question which fre- 
quently leads to much heated debate is: Should teachers use objective tests or 
essay tests? It is obvious—in light of this principle—that both should be used. 

jective tests are most effective for measuring some educational objectives 
and essay tests are most effective for others. Similarly with other evaluation 
techniques. The question is not: Should this technique be used? but rather: 

hen should this technique be used? I ; f 

3. Comprehensive evaluation requires a variety of evaluation techniques. No 


single evaluation technique is adequate for appraising pupil progress toward all 
ion. In fact, most evaluation techniques 


of the important outcomes of instructi 
3J. P. Guilford, “The Structure of Intellect,” Psychological Bulletin, 53, 267-293, 1956. 
* Benjamin S. Bloom (ed.), Taxonomy of Educational Objectives (New York: Longmans, 


Green, 1956). 


pes of evalu- 


16 The Evaluation Process 


are rather limited in scope. An objective test of factual knowledge provides 
important evidence concerning a pupil’s achievement, but the results er 
little or nothing about how well he understands the material, the extent to es 
he is developing thinking skills. how his attitudes are changing. how he be 
perform in an actual situation requiring application of the knowledge, or w si 
influence the knowledge might have on his personal adjustment. Such ap ses 
require evidence beyond that which can be obtained by an objective test. ring 
testing. self-report techniques. and various observational methods would al 
needed to evaluate such a diverse array of instructional outcomes. 

One reason we have so many different types of evaluation procedures is that 
each provides unique but limited evidence on some aspect of pupil behavior. oe 
get a more complete picture of a pupil’s achievement, we need to combine oe 
results from a variety of techniques. If the techniques are selected in terms o 
the specific purposes they can best serve—as suggested previously —then our 
composite picture of the pupil will be as adequate as we can obtain with our 
present evaluation instruments. -. te 

4. Proper use of evaluation techniques requires an awareness of their a 
tations as well as of their strengths. Evaluation techniques vary from fairly wel < 
developed measuring instruments (e.g., scholastic aptitude tests) to rather crude 
observational methods. Even our very best educational measuring instruments; 
however, fall far short of the precision we would like them to have. All are 
subject to one or more types of error. 

First, there is sampling error. Since we can only measure a small sample of 7 
pupil’s behavior at any one time, there is always a question of the adequacy 0° 
the sample. Is this spelling test of twenty words a good sample of the pupils 
spelling ability? Is this test of social studies a representative sample of what the 
pupils should know about social studies? Are these observations of a pupil's 
social behavior typical of his general social adjustment? Such questions make 
clear the problem of obtaining an adequate sample and the possibility of sam- 
pling error being present in our measurements of pupil behavior. . 

A second source of error is found in the evaluation instrument itself or in 
the process of using the instrument. For example, scores on objective tests are 
influenced by chance factors such as guessing; scores on essay tests are modified 
by the subjective judgment of the person doing the scoring: the results of self- 
report techniques are distorted by the individual’s desire to present himself in a 
favorable light: and observations of behavior are subject to all of the biases 
of human judgment. These and other errors inherent in the use of evaluation 
techniques must be recognized if the techniques are to be used wisely. 

A major source of error arises from improper interpretation of evaluation 
results, Persons unwilling to recognize the limitations in measurement and 
evaluation instruments attribute to them a precision they do not possess. It is 
not uncommon for teachers to distinguish betw 


one or two test-score points when such 
chance errors alone. At best, our ey. 


mate results and should be interpret 


een two pupils on the basis of 
differences can be accounted for by 
aluation instruments provide only approxi- 
ed accordingly. 


The Role of Evaluation in Teaching 17 


ai Neste oe ie " 
skeptical of evaluation d A h Eres ' : Mee who m yerl 
e e proce lures. A ealthy awareness of the limitations of 
wasan ie — makes it possible to use them most effectively. Many of 
senna hat commonly occur in the evaluation process can be eliminated by 
ie a eral: and selecting evaluation techniques. Others can be 
ced ey sms skill in use of the techniques. The remainder can be 
ile a in the interpretation of results. Much of what is written in the 
ah pe is directed toward helping you minimize errors and use 
aes ate hniques skillfully. As a guiding principle, it will be helpful to 
‘sat mind that the cruder the instrument, the greater the limitations and, 

quently, the more skill needed in its use. 
" ws is a means to an end not an end in itselj. The use of evaluation 
ie ater s implies that some useful purpose will be served and that the teacher 
To Bis =, aware of the purpose. To blindly gather data about pupils and then 
ws br ape away in the hope that it will some day ‘prove useful is a 
Sm oth time and effort. If standardized tests are used, it is also a waste 

ney. Unfortunately, in some schools there is still a tendency to administer 
fre batteries of standardized tests with little regard for the use to be made 
stings results. Such practices are of no value to the school program and may 

y be harmful if the motives for testing are misinterpreted by pupils, teach- 
ers, or parents. 


Most of the misuses of tests and other evaluation techniques can be avoided 
aining information upon which to base 


by viewi 
y viewing evaluation as a process of obt 
that the types of decisions to be made 


e i ie : i 
a decisions. This would imply 
ould be identified before the evaluation procedures were selected; that the 


eva 4 : is 3 

% ep procedures would be selected in terms of the decisions to be made; 
n - i $ š 
that no evaluation procedure would be used unless it contributed to im- 
proved decisions of an instructional, guidance, or administrative nature. 


SUMMARY 


ys an important role in the school. It is an integral 
d it provides information which serves as 
ns. The main emphasis in educational 


ene evaluation pla 
ai bus = instructional program an a 
Srinat or a variety of educational decisio ' 
ion, however, is the pupil and his learning progress. 
Evaluation may be defined as a systematic process of determining the extent 
= which educational objectives are achieved by pupils. The evaluation process 
includes both measurement and nonmeasurement techniques for describing 


changes in pupil behavior as well as value judgments concerning the desirability 


of the behavioral changes. 

iy anaes nature of teaching, learning, an 

RS owni sequential steps in the educational process: ) 

net ing objectives, (2) planning and directing learning experiences, (3) deter- 
ng pupil progress, and (4) using the results to improve learning and 


d evaluation can be seen in 
(1) identifying and 


18 The Evaluation Process 


instruction. Although the main purpose of evaluation is to improve learning and 
instruction, evaluative data are also useful in reporting to parents, in guidance 
and counseling, and in school administration. 

A vast array of evaluation procedures is available for school use. Some are 
designed to determine what a pupil can do (e.g., measures of aptitude and 
achievement) and others provide evidence on what the pupil will typically do 
(e.g., measures of interests, attitudes, and personality factors). Specific proce: 
dures include various types of tests, self-report techniques, and observational 
techniques. The tests can be classified by their technical features into the fol- 
lowing contrasting types: 


Oral and written 

Informal and standardized 

Essay and objective 

Mastery, survey, and diagnostic 
Individual and group 

Performance, verbal, and nonverbal 
Speed and power 


Self-report techniques include interview and questionnaire procedures. The 
major observational techniques are anecdotal records, checklists, rating scales, 
and sociometric procedures. 

Educational evaluation should be viewed as a process which is guided by a 
number of general principles. These principles emphasize the importance of 
the following: (1) identifying the purposes of evaluation, (2) selecting evalu- 
ation techniques in terms of these purposes, (3) using a variety of evaluation 
techniques, (4) being aware of the limitations of the evaluation techniques used, 


and (5) regarding evaluation as a basis for improved instructional, guidance, 
and administrative practices. 


SUGGESTIONS FOR FURTHER READING 


Adkins, Dorothy C. “Measurement in Relation to the Educational Process,” Educational and 
Psychological Measurement, 18, 221-240, 1958. 

Chauncey, H., and J. E. Dobbin. Testing: Its Place in Education Today. New York: Harper 
& Row, 1963. Chapter 1: “Testing Has a History.” Chapter 6: “Tests as Tools in 
Teaching.” 

Dressel, P. L., and Associates. Evaluation in Higher Education. 
1961. Chapter 1: Dressel, P. L., “The Essential Nature of Evaluation.” 

Hagen, Elizabeth P., and R. L. Thorndike. “Evaluation,” 
search. 3rd edition, New York: Macmillan, 1960. Pages 482-486. 

Thorndike, R. L., and Elizabeth Hagen. Measurement and Evaluation in Psychology and 
Education. New York: John Wiley & Sons, 1961. Chapter 1: “Historical and Philosophi- 
cal Orientation.” Chapter 2: “Overview of Measurement Methods.” 

Tyler, R. W. “Educational Measurement: A Broad Perspective,” The National Elementary 
Principal, 41, 8-13, 1961. 

Wrightstone, J. W., et al. “Educational Measurements,” Review of Educational Research, 


26, 268-291, 1956. A special issue summarizing research in the various areas of measure- 
ment for a twenty-five-year period. 


Boston: Houghton Mifin, 


Encyclopedia oÍ Educational Re- 


The Role of Evaluation in Teaching 19 


Test Bulletins* 
nia: California Test Bureau. 


A Glossary of Measurement Terms. Monterey, Califor: 
= Test Service Notebook, No. 13. New 


Lennon, R. T. “A Glossary of 100 Measurement Terms, 


York: Harcourt, Brace, & World. 
* These and other test bulletins cited in this book may be obtained free from the: test 


publishers. See Appendix C for publishers’ addresses. n 


cer | 

Chapter 2 
defining objectives 
for evaluation 
purposes 


What types of learning outcomes do you expect from your teaching— 
knowledges—understandings—thinking skills—performance skhills—attt- 
tudes? .. . Clearly defining the desired learning outcomes is the first step 
in good teaching—it is also essential in the evaluation of pupil progress: 


We have already learned that one of the most important but most neglected 
aspects of classroom instruction is that of identifying and defining the learning 
outcomes to be expected. In other words, too little attention is paid to deter 
mining precisely and specifically what we want pupils to acquire in the way 
of factual information, mental and physical skills, attitudes, and personality 
traits. As a result, one of two extreme situations usually exists. In the one case; 
our objectives are limited to the learning of material covered in a textbook an 
our teaching and evaluation procedures are primarily concerned with the rete?” 
tion of textbook content. At the other extreme, overly ambitious goals are set for 
a course—goals so general and so idealistic that their attainment is impossible 
either to achieve or to evaluate, The reason these two situations are so common 
is probably because the task of clearly defining educational objectives appears 
Gargantuan and, therefore, overwhelming. It need not be, despite some admitted 
complexities. Furthermore, rewards in terms of more effective teaching, leat 
ing, and evaluation are great. 

The purpose of this chapter is to help you learn how to reduce the task te 
manageable proportions and thus avoid the equally undesirable extremes. It iš 
designed to help you learn how to develop statements of educational objectives 


which go beyond a listing of course content and yet are realistic in terms 9 


pupil attainment. The logical first step in that direction is to analyze the nature 


20 


Defining Objectives for Evaluation Purposes 21 


of the task, Just what, precisely, kinds of things are we talking about when we 
talk about “educational objectives”? 


EDUCATIONAL OBJECTIVES AS LEARNING OUTCOMES 


Educational objectives are goals toward which pupils progress. They are the 

end results of learning stated in terms of changes in pupil behavior. The term 
behavior, as used here, refers to mental and emotional, as well as physical, 
reactions. Thus an increase in knowledge, a broadening of understanding, an 
improvement in a physical skill, a shifting of attitude, and a deepening of appre- 
ciation are all classified as changes in behavior. 
. When viewing educational objectives in terms of learning outcomes, it is 
important to keep in mind that we are concerned with the products of learning 
rather than with the process of learning. The relation of educational objectives 
(product) to learning experiences (process) designed to develop desired changes 
in behavior is shown by the diagram below: 


Learning Outcomes 
(Product) 
_— — eee 
Knowledge of parts of cell 
Skill in using microscope 
Ability to write accurate reports of 
scientific observations 


Learning Experience 
(Process) 
Pupil > Bc 
Study of cell structure of 
_ plants in laboratory 


This diagram should help make clear a number of pertinent points regarding 
the role of educational objectives in teaching-learning situations. First, note that 
objectives establish direction, and that when they are stated in terms of learn- 
ing outcomes, they consist of more than a list of content. Note also the distinc- 
tion between “study of” and “knowledge of” cell structures. The content (study 
of cell structure) is more aptly listed under process because it is the vehicle 
through which objectives (knowledge of cell structure and so on) are attained. 
F Second, consider the varying degrees of dependence that the products, 

knowledge,” “skill,” and “ability,” have on the course content. “Knowl- 
edge of parts of cell” is the most closely related, even though other spe- 
cific content (i.e., cell structure of animals) could serve the same purpose 
equally well. In the case of “skill in using microscope” and “ability to write 
accurate reports of scientific observations,” a still greater variety of course con- 
tent could be used to achieve the same objectives. This discussion should by 
no means be construed as an attempt to deemphasize the importance of course 
content. Course content is extremely important. However, content serves its 
most useful purpose when viewed as a means of obtaining educational objec- 
tives rather than as an end in itself, 
Another point illustrated by the 


vary in complexity. The first learning out 
n T., Vest geuga 


3194 SORA S -Aa 


diagram is the degree to which objectives 
come, “knowledge of parts of cell,” is 


ta oe a 
Date Oieee att m “QA Libary % 


€ 


22 The Eualuation Process 


specific, easily attained, and can be measured directly by a paper-and-pencil test. 
The last learning outcome, “ability to write accurate reports of scientific ob- 
servations,” is rather general, cannot be attained completely in a single course: 
and can be evaluated only by subjective means. Since this variation in the 
complexity of learning outcomes causes considerable difficulty in the process of 
identifying and defining educational objectives, the various dimensions of objec- 
tives need to be considered before we proceed further. 


DIMENSIONS OF EDUCATIONAL OBJECTIVES 


When educational objectives are viewed as learning outcomes. they can be 
described along a number of dimensions. For example, they may be described in 
terms of how general or specific they are. how tangible they appear to be, 3 
how functional they are. Although classifying objectives in terms of these ans 
similar categories results in some overlap (i.e., a learning outcome can be classi- 
fied in several different ways), discussion of the various dimensions will help 
clarify the process of stating them in behavioral terms. It will also point out 
some of the complexities mentioned earlier, 


General Versus Specific Objectives 


One way to classify educational objectives is by the degree of generality with 
which they are stated. They vary from the extremely specific “ability to add 
whole numbers” or “ability to attack new words” to the very general “ability 
to solve problems” or “ability to read with comprehension.” Specific statements 
tend to make clear exactly what behavior is expected, but also usually result 
in a long unmanageable list of unrelated educational outcomes. In contrast, too 
generalized statements convey only a fuzzy notion of what is expected. State- 
ments of educational objectives preferably should fall somewhere in between 
these two extremes. They should be general enough to organize the outcomes 
of instruction into logical categories and specific enough to indicate the be- 
havior changes expected in the pupils. 


Tangible Versus Intangible Objectives 


Another way of classifying educational objectives is in terms of their tang! 
bility.1 Objectives in the area of knowledge and skill are highly tangible. Who 
invented the cotton gin? What is 1,200 times 20? How accurately and how 
rapidly can each pupil type? Learning outcomes in these areas are easily iden- 
tified and capable of the most direct measurement. The least tangible objectives 
are those in the realm of attitudes, appreciations, interests, and facets of adjust- 
ments. What is the pupil’s reaction to a new and strange situation? What does 
he choose to read in his leisure time? In between these extremes are the objec- 
tives of intermediate tangibility, such as those dealing with understandings: 


1J. Raymond Gerberich, Specimen Objective Test Items: A Guide to Achievement Test 
Construction (New York: Longmans, Green, 1956). 


Defining Objectives for Evaluation Purposes 23 


aspects of thinking, and the like. What does the number 9 mean in the number 
1,976? What does Millay mean by “My candle burns at both its ends . . . ”? 
The general tendency in the past has been to avoid the less tangible outcomes 
of instruction and to concentrate on improving the methods of evaluating knowl- 
edges and skills. However, there have been successful attempts to identify and 


ble areas as social sensitivity, appreciation 
of literature and art, interests, attitudes, and adjustments, and these have led to 
considerable interest in the less tangible objectives of education.” This interest 
has not been limited to the affective domain of behavior but has also been con- 
cerned with such cognitive areas as understandings, applications of knowledge, 
and aspects of thinking.® Although learning outcomes in these less tangible areas 
are more difficult to identify and evaluate, the importance of these objectives 
seems to justify the additional effort required. Improvement in the entire edu- 
cational process is more likely to result from efforts to identify and evaluate all 
important outcomes of instruction than from attempts to improve instruments 
for evaluation in a highly tangible but limited area. 


evaluate pupil growth in such intangi 


Ultimate Versus Immediate Objectives 


An especially useful way of viewing educational objectives is in terms of the 
extent to which they are ultimate or immediate.’ Ultimate objectives are those 
Concerned with typical performance of individuals in the actual situations they 
will face in the future. For example, good citizenship is reflected in adult life 
through voting behavior, interest in community affairs, and the like; safety 
consciousness shows up in safe driving and safe work habits and in obeying 
safety rules in other areas of daily activity; critical thinking is apparent in an 
individual’s resistance to propaganda, his evaluation of arguments, and his 
Seneral approach to life’s problems. Although these ultimate objectives are the 
Important goals of education, they generally cannot be evaluated directly for 
obvious reasons of complexity and the fact that most of us are teaching children 


and adolescents rather than adults. 

° We must therefore usually be content wit 
identifying the immediate objectives of instruction, however, 
Move closer to the ultimate objectives than has been the practice in the past. 
Hite way to do this is to include objectives of typical performance, wherever 
Possible, which are closely related to the ultimate situation. 

' Evaluation in driver-training classes serves as a good illustration of how 
immediate objectives can closely parallel the ultimate goal being sought. The 
ultimate objective of driver training is “the operation of an automobile in a 


h more immediate objectives. In 
it is possible to 


2 Eugene R. Smith, Ralph W. Tyler. and the Evaluation Staff, Appraising and Recording 


S 
tudent Progress (New York: Harper & Row. 1942). a 
š Benjamin S. Bloom (ed.), Taxonomy of Educational Objectives (New York: Longmans, 
Green, 1956). 

4 f 
I E. F. Lindquist (ed.), Educational 
n Education, 1951). 


Measurement (Washington, D.C.: American Council 


24 The Evaluation Process 


safe manner.”5 The immediate objectives include “knowledge of the rules of 
the road,” “knowledge of how to operate an automobile,” and “ability to open, 
ate an automobile safely.” It should be noted that although the first two objec- 
tives (pertaining to knowledge) are most easily and accurately evaluated, the 
last objective is the one most pertinent to the ultimate goal. Consequently, an 
evaluation of how safely a student actually operates the automobile, even though 
it be by use of subjective devices such as a rating scale or a checklist, is m 
significant to the ultimate purpose of the course than are precise measures 0 
the students’ knowledge. ' 

The illustration of ultimate versus immediate objectives in driver training 
is rather obvious, since most people recognize that an individual may know 
the rules of the road and know how to operate an automobile and still not be 
able to drive an automobile skillfully or safely. Parallel situations exist in many 
other areas of evaluation, however. Knowledge of subject matter is frequently 
accepted as evidence of an understanding of subject matter, even though studies 
have shown a relatively low relationship between the two types of learning 
comes. Knowledge of study skills is commonly the only objective evaluated a 
the end of a unit designed to improve study habits, even though teachers all too 
often must deal with the discrepancy between knowledge and application. (We 
are all familiar with the student who habitually gets perfect scores on spelling 
tests but misspells every fifth word in all his written assignments! ) Likewis® 
knowledge of safety rules is sometimes taken as evidence of a predisposition i. 
behave in a safe manner, knowledge of procedures for maintaining good health 
as evidence of effective health habits, and knowledge of good literature as eV! 
dence of appreciation of good literature. 

Knowledge is, of course, an important outcome, the foundation for all other 
learning, and an outcome which should be evaluated for its own sake, Knowl- 
edge alone, however, cannot be used as evidence of having acquired understand- 
ings, skills, habits, attitudes, or appreciations, since there may be little or a 
relationship between the acquisition of these outcomes and the acquisition o 
knowledge. The most effective procedure for assuring that all important leat 
ing outcomes are being evaluated properly is to state the immediate objectives 
of instruction in a manner to reflect most clearly the ultimate objectives to be 
achieved, and then to develop, or select, the evaluation procedures best suited 
to each of the various types of objectives. 


Single-Course Versus Multiple-Course Objectives 


Still another means of classifying objectives is on the basis of the number of 
courses contributing to the stated goals. Some educational objectives are unique 


“Tt is interesting to note that the lower accident rate among persons completing driver 
training indicates this ultimate objective is being achieved, at least in part. If we could 
obtain similar evidence concerning achievement of other ultimate objectives, both teaching 
and evaluation could be improved more rapidly. 

° Ralph W. Tyler, “The Relation Between Recall and Higher Mental Processes,” in C. H. 


Judd, Education as Cultivation of the Higher Mental Processes (New York: Macmillan, 
1936). 


Defining Objectives for Evaluation Purposes 25 


t a particular course so that other educational experiences make little or no 
direct contribution to their attainment. Knowledge of subject matter, for exam- 
ple, is usually concerned with separate and distinct content for each course. 
In fact, special efforts are usually made to prevent an overlapping of content 
from one course to another. A similar situation commonly exists for such learn- 
ing outcomes as understandings, laboratory skills, and performance skills. In 
such instances, the educational objectives are limited to those learning outcomes 
that can be derived from the specific course. 

On the other hand, a number of educational objectives are dependent on a 
Wide variety of courses and a number of years for their attainment. A good 
illustration of this is found in the development of thinking skills. Certain courses 
may well make greater contributions than others to the development of thinking 

be included in the objec- 


skills, but learning outcomes in this area can logically 
tives stated by every teacher. This is based on the assumption that a single 


learning experience, or a single series of learning experiences, will make little 
change in a pupil’s methods of thinking. The cumulative effect, however, of 
many learning experiences over a period of years, in a variety of subject matter 
areas, is apt to make an appreciable difference in the development of thinking 
skills. Other objectives, such as those pertaining to communication skills, study 
skills, attitudes, appreciations, and the development of various character traits, 
can also be considered within the province of every teacher and, therefore, are 
multiple-course objectives. It appears: jn fact, that the more complex the objec- 
tive, the more desirable is the use of a multiple approach toward its attainment. 
The fact that some educational objectives are common to a nw cour 
requires consideration of more factors than would be the case if all objectives 
Were unique to one particular course OF another. In addition to identifying, 
Selecting, and clarifying for evaluation purposes the more obvious learning 
Outcomes of his course, the teacher must also consider the contribution his 
course can make to the more general development of the pupil. This requires a 
Knowledge of the general objectives of the school, an understanding of the 
Potentialities of his subject with regard to these objectives, and the ability to 
State the learning outcomes of his course in such a manner that they clearly 
indicate degrees of progress toward the more general objectives. 


Stated Versus Functional Objectives 
terms of the degree to which they are 


OPerative in the instructional program There is, frequently and unfortunately, 


a wide discrepancy between the objectives which are stated for a course and 
those which are implicit in the teaching-learnnig process. This discrepancy may 
onal objectives in such vague and 


© due partly to a tendency to state educati a 
Seneral terms that they are difficult to translate into classroom practice. Even 
when that error is avoided and objectives are more clearly and specifically 
Stated, however, the stated objectives commonly fail to contribute to the whole 
educational process because of jnadequate attention to, or improper choice of, 


the evaluative techniques used. An objective test of terminology and specific 


Objectives may also be classified in 


26 The Evaluation Process 


facts will do more to determine the type oÍ learning pupils engage in than any 
number of impressive statements concerning the development of critical think- 
ing, scientific methods, and the like. This does not mean to imply that me 
regard as significant only those experiences on which they are to be evaluated, 
but “things that count” certainly have a major influence on how and what they 
learn. Even college students frequently ask what will be included in examina- 
tions and what types of questions will be used. : 
Recognizing and removing the discrepancy between the stated objectives dor 
a course and the functional objectives arising from the teaching-learning = 
ation is therefore extremely important if we are to improve instruction an 
provide optimal learning conditions for pupils. In fact, this is one of pa 
primary ways in which improved evaluation procedures contribute to improve, 
learning. By stating educational objectives clearly, in terms of changes in Ma 
behavior, and by selecting or constructing evaluation instruments that actually 
evaluate these changes, it is possible to translate the stated objectives of instruc- 
tion into functional goals which guide and direct the learning of pupils. This a8 
one of the main purposes of a sound evaluation program and is a major thesis 


of this book. 


DETERMINING EDUCATIONAL OBJECTIVES 
GENERAL CONSIDERATIONS 


When teachers first approach the task of identifying educational objectives 
for evaluation purposes, they are frequently confused by the seemingly endless 
array of learning outcomes that might be considered and by the lack of authori- 
tative information concerning which objectives are most valuable for specific 
purposes. There is no simple method for identifying and selecting objectives» 
but a systematic approach eliminates much of the confusion and provides some 
assurance that important objectives are not overlooked. I 

A systematic approach involves proceeding from the general to the specific. 
This section, then, is devoted to discussing the general factors, and the following 


ig? : - e 
section illustrates a specific procedure for relating the general factors to th 
defining of objectives for a particular course. 


Types of Learning Outcomes to Consider 


Although the specific learning outcomes resulting from a course of study may 
run into the hundreds, most of them can be classified under a relatively small 
number of headings. Any such classification is of necessity arbitrary, but it 
serves a number of useful purposes. It indicates types of learning outcomes that 
should be considered; it provides a framework for classifying those outcomes: 
and it directs attention toward changes in pupil behavior in a variety of areas. 

The following list of types of outcomes delineates the major areas in which 
educational objectives might be classified. The more specific aspects of behavior 
under each type should not be regarded as exclusive; 


; they are merely sugges 
tive of what might be considered in that facet of education. 


Defining Objectiues for Evaluation Purposes 27 


1. Knowledge: 
l.l Terminology. 
12 Specific facts. 
1.3 General principles. 
1.4 Methods and procedures. 


2. Understandings: 
2.1 Ability to apply knowledge (in novel situations). 
22 Ability to interpret cause and effect relationships. 
23 Ability to explain methods and procedures. 


3. Thinking Skills: 
3.1 Ability to generalize from given data. 
3.2 Ability to recognize assumptions underlyin, 
3.3 Ability to recognize the limitations of data. 


g generalizations. 


4. General Skills: 
4.1 Laboratory skills. 
4.2 Performance skills (art, music, sports, and so on). 
4.3 Communication skills. 
4.4 Computational skills. 
4.5 Work skills. 
4.6 Study skills. 
4.7 Social skills. 


5. Attitudes: 
5.1 Social attitudes. 
5.2 Scientific attitudes. 


6. Interests: 
6.1 Personal interests. 
6.2 Vocational interests. 


7. Appreciations: me 
7.1 Critical judgment of thing appreciated. 


7.2 Enjoyment of thing appreciated. 


8. Adjustments: 
8.1 Social adjustments. _ 
8.2 Emotional adjustments.’ 


reveals the wide variety of learning out- 
developing a list of objectives for a par- 
will identify objectives in all of 


Even a cursory glance at this list 
Comes that can be contemplated when 
ticular course. Not every teacher, of course, 


28 The Eualuation Process 


P s e 
these areas. Age level of pupils, subject matter area, and the philosophy óD 1 
school will determine the nature of the learning outcomes to be apauine: M 
a particular teacher's objectives. In general, however, we need to Ra TN 
concept of learning outcomes so that all logical goals of a course are i 
in the final list of course objectives. 


Sources of Suggestions for Educational Objectives 


A classroom teacher who desires to supplement his own ideas ann 
educational objectives will find it relatively easy to obtain w * eee 
tions. There are numerous published lists of objectives covering all grade most 
and all subject matter area teacher: 
helpful after a tentative list of objectives has been developed by the 


idity 
u ° (emas S aie] wa ida 
At this point, they assist him in evaluating the comprehensiveness an 
of his own list of objectives. 
Two of the most com 


W asp; Educa: 
found in Elementary School Objectives* and Behavioral Goals of Genera 


. Ë P TON w rally 
s. These various lists of objectives are gene 


n 


É Cou 
of Teachers of Mathematics, the National ]s0 


T n. Ë 
° š jation 
and the National Science Teachers Associ 


for the Social Studies, 
include suggested statem 

A source of suggestio 
the cognitive area is the 


S Nolan C. Kearney, E 
1953). 


9 Will French and Associates, Behavior 
York: Russell Sage Foundation, 1957). 


jon: 
Jatio! 
ace Found 
lementary School Objectives (New York: Russell Sage 
york: 
stan New 
10 Chester Harris (ed.), Encyclopedia of Educational Research (3rd edition. 


(Nev 
a hool 
al Goals of General Education in High Sch 


Macmillan, 1960). 


11 Benjamin S. Bloom (ed.), 


ans 
: Long™ 
Taxonomy of Educational Objectives (New York 
Green, 1956). 


Defining Objectives for Evaluation Purposes 29 


edge, (2) Comprehension, (3) Application, (4) Analysis, (5) Synthesis, and 

Evaluation. Each major section is subdivided into more specific learning 
outcomes appropriate to that area. The taxonomy is especially useful for obtain- 
ing suggestions for objectives in the spheres of comprehension, reasoning, and 


o ° : " 
ther areas concerned with the more complex types of learning outcomes. Since 
of the educational objectives, the 


i š ; 
llustrative test items are presented for each 
fication of objectives for evalua- 


axonomy is an invaluable guide to the identi 
tion purposes. 


DETERMINING EDUCATIONAL OBJECTIVES 
SPECIFIC PROCEDURE 


ntifying and defining educational 
ex one. There is no simple, or 
Some prefer starting with course 
e school, and some with lists of 


me out previously. the process of ide 
ingle ves for evaluation purposes is a compl 
es which is best for all teachers. 
oberi. some with the general objectives of th 
Jectives suggested by curriculum experts in the area. Although the same end 
results may accrue from various approaches to the problem, the following 
Procedures have been found useful to teachers in a variety of teaching fields. 


Orientati 
ion 
mplified if we constantly keep 


T " . . . 
he task of stating educational objectives 18 51 : e 
outcomes of teaching-learning 


in mi 
sit mna that we are making a list of expected : ' 
uations, We are not identifying subject-matter content but the reaction pupils 


i to make to this content. We are not listing the learning experiences of the 
ge but the changes in pupils’ behavior resulting from ep regen We 
ist o describing what we intend to do during instruction jut are — a 
the the expected results of this instruction. The point of orientation, then, is 
Pupil and what he is like at the end of the teaching-learning process. —— 
ene objectives in terms of product rather ien pen ep Š: 
and e than done. Most of us are so concerned with t pe "S wc n 
our e ongoing process in the classroom that we find it di oc aes e 
ta aion on the changes in pupil behavior which are really edie nt 
the ioe very nature of teaching conditions w to focus our a : 
“arning process. We can successfully shift this focus, however, if we con- 


ü 
ually ask ourselves: What should the pupils be able to do at the end of the 
] ‘ 9 As we attempt to answer this 


Cours : 
© that the beginning ‘ 
: y could not do at the begi L ; š 
iestion, always in terms of knowledge, understandings, skills, attitudes, etc., 
e > 3 A 
: find that the pupils and their changes in behavior have almost automatically 
Come the center of focus. We are then in a much better position to define our 


Ucati : 
tonal objectives in terms of product. 


Identify; ii 
entifying and Stating General Objectives 
we have two imme- 


course of study, 
rse objectives as 


In 
iate developing a list of objectives for & ang pe 
goals in mind. One is to obtain as complete a list o! € 


30 The Evaluation Process 


; indicate the 
possible. The other is to state these objectives so that they p k p 
changes in pupil behavior expected by the end of the alge a obiectives, 
likely to result in a comprehensive and well-stated list o i = ee 
is illustrated in the attempts of Mr. Brown to develop on Der equali 
his tenth-grade biology class. Other attacks on the za ` ay 
effective, but his procedure includes the major steps invo i d ee 
Although Mr. Brown had been teaching biology wel = y sen She malos 
the same point where a teacher-in-training might start: ` re eee mais 
purposes of the biology course he was teaching. He aske = pool Ail 
such as: What are the main reasons for teaching this course? canny 
students be like, in terms of behavior, when they complete this course? 


` COG Eš upils 
considerable deliberation. he concluded that at the end of the course p 
should know certain biological facts, 


w tory 
be able to perform certain labora 
Operations, and have a scienti 


fic attitude toward biological n s. w 
Brown recognized that these general purposes needed to be snipe PES 
divided into more definite statements of objectives and that other o 
might be added later, but he felt he had achieved a good starting tes al 
He began his analysis of the general purposes of his course with a mba 
biological facts. His first impulse was to list all of the biological facts N (hat 
pupils should know by the end of the course. When he recalled, — ad 
the pupil and his behavior is the proper point of orientation, he quickly rs able 
that impulse, and asked himself another question: What will students € ao 
to do that will indicate they “know biological facts”? In answering this q a 
tion, he soon realized that his original concept was too narrowly conceived i 
that other aspects of knowledge needed to be included. After several unpro wil 
tive attempts, he finally listed the following general objectives, properly sta 
in terms of eventual pupil behavior: 
¢Knows common ter 
Knows specific biol 
Knows general biol 
e Applies biological 


ms used in biology. 

ogical facts. 

ogical principles, 

facts and Principles to new situations. 


Since his list of knowledge outcomes w 


A o 
alyze the outcomes pertinent t 
s i- 
at the end of their laboratory se 
objectives (stated in terms of pup! 
behavior): 


*Knows common 


e Uses the microscope skillfully, 
e Performs basic Operations of d 
e Writes clear and 


laboratory procedures, 


issection skillfully. 


of laboratory experiences. 


9 go ahead with 
also be considered one of the 


Defining Objectives for Evaluation Purposes 31 


outcomes of laboratory work. As he attempted to analyze this area, he realized 
that he could describe behaviors which indicated a scientific attitude, but he 
could not identify separate educational objectives in this area. He therefore 


kept as his goal the original statement: 
eScientific attitude toward biological phenomena. 


Although he considered all of the objectives so far identified by analyzing 
the major purposes of the course important, Mr. Brown still felt his list incom- 
plete. To check for missing objectives, he analyzed the course content, topic 
by topic, asking himself the same question about each topic: Why is this topic 
included in the course? When he finished this analysis, he could add the fol- 


lowing objectives (still in terms of pupil behavior) : 


eAppreciates the achievements of scientists. 
e Ability to interpret diagrams. charts, and graphs. 


his classroom and laboratory procedures 


Mr. Brown’s next step was to study 
hing methods. 


to see if there were any objectives based primarily on his teac 
Implicit in his assignment of library work was the hope that pupils would learn 
a systematic procedure for locating information. In directing laboratory work 
he frequently stressed the importance of working cooperatively on laboratory 
projects, These reflections led to the identification of two more objectives. 


e Ability to locate biological information. 

s Ability to work cooperatively with others. 

sted his own resources in identifying 
o find out what they could offer him. 
earch’? and recent issues 


of the Review of Educational Researc ber of lists of objectives 
in biology. He found the lists a bit confusing because some included both 
Process and product objectives: some were concerned mainly with course con- 
tent, and all seemed to use somewhat different terminology. Despite these short- 
Comings, Mr. Brown was able to determine that his own list covered the main 
goals of biology, with one exception. There seemed to be an area related to 
such abilities as drawing valid conclusions from data, justifying conclusions, 
weighing evidence, and using other similar mental processes. These learning 
Outcomes were variously categorized under headings such as reasoning ability, 
scientific method, critical thinking. and problem-solving ability.13 For his own 
Purposes, Mr. Brown decided to use the phrase “critical thinking” and added 


to his list: 


By this time, Mr. Brown had exhau 
objectives, so he turned to the experts t 


Consultation of the Encyclopedia of Educational Res 
h led to a num 


e Ability to think critically. 

12 Chester Harris (ed.). Encyclopedia of Educational Research (3rd edition, New York: 
Macmillan, 1960). 

í 13 This illustrates one of the problems in stating and 
ives. The same aspects of behavior may be categorized and labeled in many different ways. 


evaluating a list of course objec- 


32 The Evaluation Process 


is objectives 

For clearness and greater usefulness, Mr. Brown then grouped = objesi 
according to the major types of learning outcomes indicated by eac! 
His complete list of general objectives is shown in Table 2.1 


Table 2.1 
y JRSE 
GENERAL OBJECTIVES FOR TENTH-GRADE BIOLOGY COUR == 
Types of Learning Outcomes General Objectives . 
Knowledge 1. Knows common terms used in biology 
2. Knows specific biological facts 
3. Knows general biological principles D 
4. Knows common laboratory procedures aa 
ineiples to new Š 
Understanding 5. Applies biological facts and principles 
ations 
Thinking Skills 


6. Shows ability to think critically 
Laboratory Skills 


Uses the microscope skillfully 


fully 
-dion skillful 
8. Performs basic operations of dissection $ a 
Tanak S ratory 
Communication Skills 9. Writes clear and accurate reports of labo 
periences 
. Š š tion 
Study Skills 10. Shows ability to locate biological ieor a ail 
11. Shows ability to interpret diagrams, gr 
charts rend 
z iiai enon 
Attitudes 12. Has scientific attitude toward biological ph 
Appreciation 13. Appreciates the achievements of scientists 
Adjustments 14, 


others 


Shows ability to work cooperatively with 


nd how well these changes had taken place. In 100 
objectives, he sa 


" nee 
w that, for evaluation purposes, a eao” 
re spec ts of learning outcomes. Thus he decided to descri 
objective in Breater detail. 


Defining Objectives for Evaluation Purposes 33 

T m reviewed his “breakdown” of the first objective, Mr. Brown could not 
fel ut notice that these specific learning outcomes almost told him how to 
e. knowledge of common terms.” They indicated precisely what a pupil 
do when he had achieved this general objective. Encouraged by this 


immedi ; “g Pag; 

area: diate success, he continued to “breakdown” the other objectives In the 

s Š ma 

of knowledge and understanding. He encountered no difficulty in iden- 
ive until he started on 


moe bs gee pupil behaviors for each general object: V 
peri ig tees critically.” This objective sent him hurriedly back to the pxo- 
alesi literature for suggestions concerning tyPes of pupil behavior which 
n w. critical thinking in biology. : 
sigs a of the literature led him to a recent yearbook of the National So- 
ee the Study of Education, entitled Rethinking Science Education." 
ns from this source, as well as from the Encyclopedia of Educational 


esea 7 
3 “toa and the Review of Educational Research consulted earlier, provided 
` Brown with a fairly comprehensive list of specific behaviors which reflected 

“Identifies and 


criti s 

defines Paci He discarded some of the suggestions (e8 iF 

fies, problem,” “Collects and organizes pertinent information,” an ormu- 

and P ssible hypotheses”) because they did not appear relevant to his Glgssrooul 
aborato 15 After sifting through the remaining sugges- 


t ry procedures. 
se of his course, the critical thinking 


= 
diene decided that, for the purpo 
ve should be limited to four specific behaviors. 


6. S} 
É 1 U :1: 
6 | oe to think critically. 
-l Distinguishes between facts and opinions. 


6.2 3 
Draws valid conclusions from given data. 


6.3 Š ° . 
Recognizes assumptions underlying conclusions. 


6. i ; 
4 Recognizes the limitations of given data. 


W „i A thre 5 
this A appreciate Mr. Brown’s relief when he had finished “breaking down 
jective. He felt that the objective of “ability to think critically” was one 


° most difficult objectives to analyze. bit abashed to recall he 
Tequently told parents at P.T.A. meetings that he believed science courses 


o Y ° y 
ped eritical thinking, when he had never before systematically determined 
S: 


evel, 
y thought “critically”! His sense of 


at bio 

Recom; o oey students could do when the i 
eng Plishment was not erased by this embarrassing recollection, however, 
i only be of aid in his evaluation of 


he felt that his analysis would not 
' tS age but would also improve his 
T gen is difficult one behind him, 

eral classroom objectives into mo 


teaching in this area. 
he continued to analyze each of the 
learning outcomes. The 


Pupil 
pr 
o 
Othe 
e specific 
Yearbook of the Na- 


tion, Fifty-Ninth 
ence Edun The University of Chi- 


. Me Di 
tiona] arrell Barnard, Rethinking Sci Ni 
I (Chicago. Illinois: 


@ Ociet 
E T y for the Study of Education, Part 
15 a 1960), ë 
s š i ë 
lear in Ould be noted that it is not always necessary, nor desirable, to discard suggested 
the ç H outcomes that donot fit classroom practice. It may be mort perm satiy 
mu. Sroom : a é m practice and learning outcomes 
St be ; practice, i owever, classtoo p! 
be in harmony Im either case, h u 


34 The Evaluation Process ! 
f; resents 
following samples of his work have been selected because each rep 


i i ji vhich indicate 
slightly different problem in identifying specific behaviors whic 
attainment of the goal. 


10. Shows ability to locate biological information. 


10.1 Uses the library card catalogue to locate references. 

10.2 Knows common sources of biological information. ea 

10.3 Uses the table of contents and index when seeking informa 
books. 

10.4 


« š ic: problem. 
Determines the relevancy of information for a particular Į 


12. Has scientific attitude toward biological phenomena. 
12.1 Suspends judgment until all of the facts are available. 
12.2 Seeks cause-and-effect-relationships in biological data. saal 
12.3 Shows willingness to consider new interpretations of biolog 
data. paz 
12.4 Makes interpretations of biological data which are free s “ait 
12.5 Indicates confidence in biological data obtained by scienti 
cedures. ` 
13. 


Appreciates the achievements of scientists. 


13.1 Indicates the main contributions of selected scientists. 
13.2 Describes the influen 


> . ife. 
ce of scientific achievements on modern li 
13.3 Reads supplementar 


š ; S: 
y materials regarding scientific achievement 
" jec- 
It should be noted that Mr. Brown did not limit himself to educational a 
tives and specific behaviors which could be evaluated by written tests. n ‘ith 
i tant outcomes of his course and then broke them do 
into those pupil behavi 


y Summary of Steps for Determining Educational Objectives 
While we have followed the 


ias . . . . . i j ct 
science in some detail, it is possible to separate his procedure from his subje y 
matter in order to review what he did. The resulting outline indicates the majo 
Steps to follow in identifying a 


š p ade 
nd defining educational objectives for any gra 
level and for any subject-matter area. 


v Tepical 
mental peregrinations of a teacher of biologi¢ 


nlifying: the 


L. Ider 
1. Identify the 


Beneral objectives, 


: general purpose of the course f 
> Analyze each general Purpose of the course into definite statements ° 
classroom objectives, 
3. 


Defining Objectives for Evaluation Purposes 85 


4. Examine the teaching methods used, and add the classroom objectives 
resulting primarily from methods of instruction. 

5. Consult lists of objectives published by experts and add those class- 
room objectives that are appropriate. 

6. Check the list of objectives against the various types of learning out- 
comes to be sure all important goals have been included. 


II. Stating the general objectives. 
l. State the general objectives in terms of pup 
2. Include only one objective in each statement. 
3. State the objectives at the proper level of generality. 
4. Group the objectives in terms of type of learning outcome indicated 


by each objective. 


il behavior. 


III. Breaking down the general objectives. 
1. List the specific learning outcomes which indicate the attainment of 
each objective. 
2. State the specific learni 
behavior. 


3. Consult the professional literature 
common meaning 


ng outcomes in terms of observable pupil 


for behavioral components of those 


concepts which lack (e.g, critical thinking, crea- 


tivity, social sensitivity, and so on). 


In breaking down the general objectives into specific pupil behaviors, it is, 
hich characterize the attain- 


of course, impossible to list all pupil behaviors w 
ment of each objective. As a guide, Travers'® suggests that enough be listed to 
clarify the typical behavior of pupils who have achieved that objective. He also 
Points out that if an individual is unable to define an objective with specific 
behaviors, the objective is meaningless as stated and should be eliminated or 


revised. 
RMINING OBJECTIVES 


the process of identifying and stat- 


OTHER EXAMPLES OF DETE 


Š The following examples further illustrate t 
Ing educational objectives for evaluation purposes. In each example, the analysis 


has been limited to only one, fairly complex, area, since it is our purpose to 
illustrate the procedure for identifying and stating objectives rather than to 


Suggest specific objectives teachers should use. 
d-Grade Level 


A universal objective of second-grade teachers is that pupils should compre- 
hend what they are able to read. Mrs. Jackson had already developed satisfactory 
descriptions of pupil behaviors which display possession of the knowledge and 
skills necessary for reading, but she experienced some difficulty in describing 
those which are evidence of comprehension. She knew that most of her students 
did comprehend what they read because they were able to produce written 


Reading Comprehension at the Secon 


16 Robert M. W. Travers, Educational Measurement (New York: Macmillan, 1955). 


36 The Evaluation Process 


z: ” 
answers to written questions, and some were even able to write story ae 
but these indications were limited in two important ways. Written = a 
volved ability to spell and write with ease, as well as ability to compre x i 
and daily written work did not provide a systematic way of 1 a 
ing levels of comprehension. She was particularly interested, also, in deve yn 
evaluative techniques which would clearly indicate the difference ee 
prehension achieved through the printed word alone, unaccompanied by ane 
clues, and comprehension achieved with the help of pictures. Standar ae 
achievement tests and unit tests published by the textbook publishers were r 
very helpful in this respect. 

r| ee of the ashes guidebook, inspection of workbook pages, and m 
analysis of her teaching methods did give her some clues as to the nature m3 
level of second-grade reading comprehension, but she felt the need of still m = 
information. She decided to consult the Taxonomy of Educational aon 
and found there that comprehension was divided into three levels: nse 
interpretation, and extrapolation. Her only problem then was to translate t on 
terms into second-grade behavior terms! As she started to work, however, § 


£ . " ma- 
was much encouraged, for the clues she had found in the analysis of her own 
terials and methods fell into place quite nicely. 
Here, in outline, is her soluti 


on to the problem of identifying specific be- 
haviors which reflect reading co 


mprehension at the second-grade level. 


Comprehension Characterized by 


Specific Behavior 


1. Translation Ability to define, paraphrase, 1.1 


" ; irec- 
Follows written or printed di 
demonstrate by example 


tions J 

12 Responds to written questions: 
either orally or in writing í 

1.3 Can tell “what happened” in 4 
story* 

2. Interpretation Ability to explain, summarize 2.1 Can indicate sequence of events 

2.2 Can choose most appropriate title 
for a story 

2.3 Can tell why that title is most aP“ 
propriate 


3. Extrapolation Ability to see implications, 3.1 


consequences, effects, cor- 3.2 
ollaries, make predictions 


Reacts emotionally to story 

Can choose most probable reason 
for particular action of a chat 
acter ¿ 

Can choose most probable ensuing 
action implied by the story 

Can choose most probable emo- 
tional attitude of a particular 
character 


3.3 


3.4 


P w š 
In each mstance, the story is not illustrated. 


17 Beniam; š 
Gres, 19585" S. Bloom (ed.), Taxonomy of Educational Objectives (New York: Longmans 


Defining Objectives for Evaluation Purposes 37 


Arithmetic Reasoning at the Fourth-Grade Level 


Mr. Whiteside encountered a problem similar to Mrs. Jackson’s when he 
attempted to identify specific objectives which would exemplify arithmetic 
reasoning on a fourth-grade level. He recognized that ability to reason arith- 
metically depends upon knowledge of arithmetic facts, knowledge of the number 
system, and familiarity with the variety and frequency of the use of quantitative 


statements and processes in a number of other subject-matter areas. It was also 


clear to him that reasoning goes beyond understanding and can rightfully be 
considered a thinking skill. Since “arithmetic reasoning” so often is used to 
designate everything except computation, however, Mr. Whiteside resolved to 
scrutinize especially those behaviors which typify understanding and those 
Which typify reasoning with the purpose of distinguishing between the two. He 
realized that both are developed slowly and in conjunction with increasing 
knowledge, and that he must keep his objectives in tune with the amount of 
knowledge expected. He therefore limited his general objectives, typical of 
ee to the following, with specific learning outcomes for each, as 
Shown: 


ps between arithmetic processes. 
addition and multiplication. 


btraction and division. 


l. Recognizes relationshi 
1.1 Can illustrate relationship between 
1.2 Can illustrate relationship between su 
1.3 Can solve problems by more than one process. 


2. Understands place values of notation. . 
2.1 Makes use of place values as guides in estimating. 


3. Recognizes quantitative meanings illustrated in diagrams, graphs, and 


charts. ia 
3.1 Can identify main point illustrated in diagrams, graphs, and charts. 
3.2 Can construct simple diagrams, graphs, or charts to illustrate quan- 


titative concept. 


4. Uses “language” of arithmetic. 
4.1 Can translate arithmetic expressions to verbal statements. f 
4.2 Can translate simple verbal statements to arithmetic ee x 
43 Uses quantitative terms accurately when appropriate m other sub- 


ject areas. 
After considerable thought, Mr. Whiteside concluded that the traditional 
‘ability to solve story problems” was quite applicable as a statement identifying 


is general objective under arithmetic reasoning and went on to — What 
9 pupils who can solve story problems do to achieve their solutions? He thought 


of four specific behaviors: 


l.l Can identify the problem (what is unknown? ). 
1.2 Can identify the relevant known facts. 


38 The Eualuation Process 


1.3 Can identify appropriate process which relates known to the unknown. 
1.4 Can solve quantitative problems by using steps 1, 2, and 3. 


Since arithmetic reasoning contributes to reasoning in general, and ier 
both are obviously multiple-course objectives, Mr. Whiteside was not yet sa he 
fied that he had identified all of the pertinent specific behaviors. However, i 
felt that he had made a good start in arithmetic, and that other learning o 


d 
i 2 evelope 
comes in the area of reasoning would become more apparent as he d 
objectives in other subject-matter areas. 


Effectiveness of Communication at the Eighth-Grade Level 


An eighth-grade language arts teacher, Mrs. Parsons, had a somewhat es 
ferent situation confronting her. As she was making a list of objectives for ees 
position, she found herself entangled in the varying complexities of the een 
of different skills which are used in organizing and presenting a "Ek eni a 
She knew that eighth graders often are not yet able to pursue detailed ana aa 
of sentence structures, but that most are capable of using language wee i 
ing their ideas for presentation. She wanted her objectives to be ar a 
all kinds of verbal compositions: she wanted to include both written an a 
composition; and she wanted to include those behaviors which, if absent, “al 
detract from the effectiveness of any communication even though they do 
contribute directly to skillful compositional structure. ded to 

Rather than classifying objectives by degree of complexity, she decide asi 
list all the important objectives relating to all forms of composition and to w 
for evaluation, those which applied to the particular case at hand. She wo mn 
not consider the ability to outline, for instance, in evaluating a short =“ 
Since the point of composition is communication, Mrs. Parsons cam 5 
stating her general goal as that of “communicating ideas clearly and concise ye 
She then listed her objectives and more specific learning outcomes as follows: 


l. Uses language skillfully. 


1.1 Uses correct grammatical forms. 
1.2 Uses correct punctuation and capitalization. 
1.3 


Uses words recently added to vocabulary. 
1.4 Uses complete sentences. 
1.5 Varies type, length, and structure of sentences. 
1.6 In written composition: 
161 Spells correctly. 
162 Writes legibly. 
163 Prepares written work which is neat and attractive to the ey® 
1.7 In oral composition: 


1.71 Uses correct pronunciations. 
1.72 Speaks clearly. 
1.73 


Uses inflection to emphasize and “point up” ideas. 
1.74 


Presents a neat and clean appearance. 


Defining Objectives for Evaluation Purposes 39 


2. Organizes material logically for unity and coherence. 
> > outlines clearly showing “beginning; middle, and end.” 
— r laws outline in preparing finished composition. 
- evelops topic sentences into paragraphs. 


2. ae 
— Limits paragraphs to one central idea. 
.5 Limits composition to issues and ideas relevant to main theme. 


S . 
ocial Understanding at the High School Level 


Mr ; š ? z 
sol . Morris, who teaches social studies at the high school level, assigned him- 
a z 
reali ~~ he knew would be difficult to complete. As a matter of fact, he 
z 
ç on = could never regard the task as finally complete, for the nature of 
ited continui Š sie ct gee Se Da š 
aehd inuing refinement and continuing modifications in his teaching 
For ; 
whid some time he had been concerned with a particular aspect of learning 
ass was common to all his courses. He was interested in the contribution his 
ph =a could make toward developing the social understanding he hoped his 
RE s would ultimately achieve. A search through the literature and discussions 
h several other social studies teachers led to the following tentative list of 


gene: maae” " 
ral objectives and specific learning outcomes: 


Ls 
hows awareness of social phenomena. 


a Identifies common social problems a 
.2 Recognizes universality of human needs as motivating 


nd their related issues. 
forces behind 


social movements. 
1.3 Recognizes the roles of social in 
14 Interprets the behavior of indivi 
and phenomena objectively. 
social problems. 


stitutions in society- 
duals in light of social phenomena. 


2. Analyzes social problems, issues, 
2.1 Identifies the significant aspects in 
2.2 Distinguishes facts from opinions. 

2.3 Recognizes bias, prej udice, and other distortions in social statements. 
2.4 Discriminates between issues in terms of their relev 


social problems. 
2.5 Identifies cause-and-effect 


ng social problems an 
he need for social action. 
garding social problems from a variety of 


ance to particular 


relationships in social data. 

3. Shows interest in solvi d resolving social issues. 
3.1 Frequently points out t 
3.2 Obtains information rë; 

source materials. 
3.3 Presents sound theoretical solutions to social problems. 
3.4 Participates in school groups engaged in social action. 
toward social phenomena. 


4. ñaaa " 
Maintains a scientific attitude 
] the information is complete enough to per- 


4.1 Suspends judgment unti 
mit conclusions. 


4.2 Carefully considers ideas contrary to his own. 


40 The Evaluation Process 


4.3 Revises conclusions when additional reliable information is obtained. 
4.4 Seeks cause-and-effect relationships in social data. 
4.5 


Evaluates all ideas, opinions, and conclusions in terms of the authori- 
tativeness of the information supporting them. 


Although Mr. Morris was far from satisfied with his completed list of objec- 
tives and learning outcomes, he felt that he h 
lating a previously intangible goal into identi 
list as a guide, he was certain that his teach 
effective and that evaluation of these specific 
further modification and refinement of the so 


ad made great progress in trans- 
fiable pupil behaviors. With this 
ing in this area would be more 
learning outcomes would lead to 
cial understanding objectives. 


APPRAISING THE FINAL LIST OF OBJECTIVES 


Throughout this chapter, we have emphasized the role of the classroom 
teacher in the process of defining educational objectives. We have deliberately 


: š A š ¿£ aes : aa t 
avoided discussions concerning which objectives should receive priority à 
various grade lev 


school boards, administrators, curriculum committees, and individual teachers. 
Our aim has b and state educational objectives so 
d evaluation purposes. 

r a particular course, however, the 
termining the adequacy of his final 
questions will serve as criteria for this purpose: 


the pupils are also easily overlooked, 


A safe procedure in appraising a final 1 
by curricul 


Defining Objectives for Evaluation Purposes 41 


inc "ta of applying this criterion is that the goals of the school 

audy AE. stated and therefore must be inferred from the course of 

a aa E ucational practices in the school. Nevertheless, the teacher must 

“sq aan S X gment concerning the appropriateness of his objectives as teach- 
s particular school. 


: Are the objectives in harmony with sound principles of learning? 
“asp iwaw seis indicate the desired outcomes of a series of learning experi- 
sh a ey should be consistent with sound principles of learning. That is, they 
ou d (1) be appropriate to the age level and experiential background of the 
ey (principle of readiness), (2) be related to the needs and interests of the 
=p of motivation) , (3) reflect learning outcomes which are most 
as s (principle of retention), and (4) include learning outcomes which 
fer), A Aire applicable to various specific situations (principle of trans- 
saan now edge of child and adolescent development and the psychology of 
aan g is needed to apply such criteria effectively. It can be pointed out here, 
ae er, that understandings, thinking skills, applications of knowledge, and 
r complex learning outcomes tend to be retained longer and to have greater 
transfer value than the more simple learning outcomes such as knowledge of 
Specific facts. Consequently, special efforts should be made to include such 


learni P hae 
arning outcomes in the final list of objectives. 


4. Are the objectives realistic in terms of the abilities of the pupils and the 


ume and facilities available? 
eng attempts at identifying obj 
obje me an impressive but unattainab 
T should be reviewed in light i 
ties a available for achieving the objectives, 
att and equipment available. It is usually better to have several 
ainable objectives than a long list of nonfunctional goals. 
š. should not discourage the inclusion of multiple-course objectives. Al- 
BE such objectives are not completely attainable in a particular course, 
stic degrees of progress toward their attainment can be indicated. 
5. Are the objectives stated clearly in terms of changes in pupil behavior? 
i me this criterion has been stressed in earlier sections of this chapter, 
s included here so that it will not be overlooked in appraising the final list 


of objectives. The objectives should be stated as expected learning outcomes, 
clearly indicating what the pupil is like who has satisfactorily completed the 
learning experience. They should not include what the teacher is going to do, 

pects of the teaching process. 


th ; 
e subject-matter content to be used, or other as 
iption of the learning prod- 


R i . 
Sair the objectives should provide a precise descri 
ct in terms of desired changes in pupil behavior. 


SUMMARY 
t learning outcomes we expect from 


bjectives for a particular course frequently 
le list of goals. Thus, the final list of 
of the abilities of the group members, 
and the adequacy of the facili- 
clearly defined 


" Educational objectives make clear wha 
ur teaching. They are our teaching goals expressed in terms of the desired 


42 The Evaluation Process 


eH ‘ he 
and single-course objectives, and N š 
instruction in such a manner that they 
become functional goals which guide and direct the learning of pupils. 


i a in terms 
ives are viewed as learning outcomes, stated in 


area of understandings, thinking skills, performance skills, attitudes, sang: 
appreciations, and adjustments should also be considered. — irera? 
objectives in these and other areas may be obtained from the professional A 
ture, However, each teacher should develop his own specific list of objec 


aunity 
which takes into account the unique features of the school and the comn 
in which he is teaching. 


The procedure for determinin 


; the 
g objectives for a Particular course includes th 
following steps: (1) identifyin 


š r- 
g the general objectives by analyzing the Poa 
e, the teaching methods used, and lists of obj 
experts; (2) 


ng Sect dor 
list of objectives can be appraise 
hich it ical goals of the course, 
> (3) is in harmony with 
he abilities of the pupil 
A (5) clearly indicates the expecte 
changes in Pupil behavior, 


SUGGESTIONS FOR FURTHER READING 


Evaluating Elementary School Pupils- 
pter 3: “Educational Objectives A 
- Evaluation in Higher Education, Boston: Houghton Mifflin, 
+ and Dressel, P. L... “The Objective. 
Instruments, New York: I 
valuate.” Chapter 3: 


- Š. Bloom, and B. B. 


2. Ë 
in Evaluation. 


s of Instruction.” 
“ongmans, Green, 1958. Chap- 
“Defining the Behavior,” 2 
Masia. 4 Taxonomy of Educational Objectives: 
w York: David McKay, 1964. 


Defining Objectiues for Evaluation Purposes 43 


Mager, R. F. Preparing Objectiues jor Programmed Instruction. San Francisco: Fearon 
Publishers, 1962. A fifty-nine-page self-instructional text designed to help teachers de- 
velop skill in stating instructional objectives in behavioral terms. Broader than the title 
indicates, 

Thomas, R. M. Judging Student Progress. New York: Longmans, Green. 1960. Chapter 2: 
“Stating Goals.” 

Schwartz, A., and S. C. Tiedeman. Evaluating Student Progress in the Secondary School. 
New York: Longmans, Green, 1957. Chapter 3: “Identifying Educational Outcomes.” 
Chapter 4: “Determination of Classroom Objectives.” 


p PER 

Chapter 3 

relating evaluation 
procedures to 
objectives 


— 


: š : romes. ++ * 

Educational objectives encompass a variety of learning oe ovalis 

Evaluation includes a variety of procedures... . The key to i to the 
° š ; 

tion is to relate the evaluation procedures as directly as possible 


specific learning outcomes being evaluated. 


Bon a i ching- 
By now it should be clear that evaluation is an integral part of the tea 
learning process. It is not som 


ething tacked on at the end of a course: it er 
limited to the measurement of the amount of factual material retained: i w° 
not limited to Paper-and-pencil examinations. Evaluation is a continuous co 
prehensive process which uti 


ich is ines- 
lizes a variety of procedures and which is i 
capably related to the objectives of the educational program. 


In the last chapter, we were concerned w 
for evaluation purposes. This included The 
then breaking these objectives down into more specific learning outcomes. 
final step in the evaluati i 
which provide the most direct evidence concerning the attai: 
learning outcome, 

The following se 


quence of steps summarize. 
evaluation techniq 


s this general procedure for relating 
ues to objectives: 


44 


Relating Evaluation Procedures to Objectives 45 


GENERAL OBJECTIVES 
(Goals which direct our teaching) 


4 
SPECIFIC LEARNING OUTCOMES 
(Pupil behaviors we are willing to accept as evidence of the attainment of objectives) 


4 
EVALUATION TECHNIQUES 
(Procedures for obtaining samples of pupil behavior described in the specific learning 


outcomes) 


These procedural steps clarify the importance of relating the evaluation tech- 
niques directly to the specific learning outcomes being evaluated. This is the 
only way we can have any certainty that we are evaluating pupil progress toward 
the objectives we have selected as our teaching goals. 

The process of relating evaluation techniques to specific learning outcomes is 
essentially one of logical analysis and judgment. This process can be greatly 
facilitated, however, by the use of some systematic evaluation plan. 


GENERAL EVALUATION PLAN 


Whether a teacher is deciding on evaluation procedures for a unit of work, 
a semester’s work, or a sequence of courses, some general evaluation plan is 
desirable. As a minimum, this plan should include a list of the desired learning 
outcomes and the techniques to be used in evaluating progress toward them. 
The following chart, based on several of the objectives developed by Mr. Brown, 
our tenth-grade biology teacher, illustrates the procedure for developing a gen- 
eral plan. The numbering system is that used by Mr- Brown (in the last chap- 
ter) and helps identify each objective in his original list. The complete evaluation 
chart would, of course, include all of the objectives and specific learning out- 
comes identified by Mr. Brown. ; 

Mr. Brown’s chart for a general evaluation plan clarifies a number of impor- 
tant points concerning the relationship between educational objectives and 
evaluation procedures. For one thing, it makes clear the fact that the specific 
learning outcomes, stated in terms of pupil behavior, are so numerous and 
varied that no single evaluation technique could possibly provide adequate 
evidence concerning the achievement of these educational objectives. Although 
objective tests are indicated for many of the learning outcomes, checklists, 
anecdotal records, and other observational techniques are also frequently men- 
tioned, The chart also highlights the importance of a clear statement of the 


objectives and learning outcomes in selecting the evaluation technique. In fact, 
o 


when the learning outcomes are clearly stated in terms of pupil behavior, they 


not only indicate what is to be evaluated but they also suggest how to evaluate. 
For example, the item “1.1 Defines common terms” makes it clear what type 
of evaluation technique should be used. It indicates that the pupil must provide 
the definitions himself. Therefore, the short-answer essay; in which the pupil is 


given selected terms and asked to define them, is the most adequate technique 


of evaluation, An objective test item, such as multiple choice, where the pupil 


46 The Evaluation Process 


must merely recognize the definition, would be inadequate for moran 
learning outcome, as stated. Of course, the specific learning outcome sa pm 
restated to read “Recognizes the meaning of common terms ' so that o nee “ei 
test items could be used. However. this would be a change in the specific be 


Objectives and Specific Learning Outcomes 


Evaluation Technique 


l. Pupil knows common terms used in biology (Evaluation technique “ss n> 
when he: learning outcome with same 
ber.) 
1.1 Defines common terms, 11 Short-answer essay test. 
1.2 Differentiates between common terms on 1.2 Objective test. 
basis of meaning. =— 
1.3 Recognizes the meaning of common terms 1.3 Objective test. 
when used in context. 
6. Pupil shows ability to think critically when he: I 
6.1 Distinguishes between facts and opinions. 6.1 Objective test. 
6.2 Draws valid conclusions from given data. 6.2 Short-answer essay test. 
6.3 Recognizes assumptions underlying conclu- 6.3 Objective test. 
sions. 
6.4 Recognizes the limitations of given data. 6.4 Objective test. 
8. Pupil performs basic operations of dissection 
skillfully when he: . 
8.1 Places the specimen in the proper position 8.1 Checklist or rating scale. 
for dissection. 
8.2 Cuts skillfully without damaging the struc- 8.2 Checklist or rating scale. 
ture to be studied, 
8.3 Separates the structural parts of the speci- 8.3 Checklist or rating scale. 
men without damaging them. 
8.4 Completes dissection in allotted time. 8.4 Checklist or rating scale. 
10. Pupil shows ability to locate biological informa- 
tion when he: 
10.1 Uses the library card to locate references, 10.1 Research report. Observation. 
10.2 Knows common sources of biological in- 10.2 Objective test. 
formation, 
10.3 Uses the table of contents and index when 10.3 Observation. 
seeking information in books. Š 
10.4 Determines the relevancy of information 10.4 Research report. Observation. 
for a particular problem. 
12. Pupil has a Scientific attitude toward biological 
phenomena when he: . 
12.1 Suspends judgment until all of the facts 12.1 Anecdotal records. Objective 
are available. test. 7 
12.2 Seeks cause-and-effect relationships in 12.2 Anecdotal records. Objective 
biological data. test. 
12.3 Shows willingness to consider new inter- 12.3 Anecdotal records. Objective 
pretations of biological data. test. 
12.4 Makes interpretations of biological data 12.4 Anecdotal records. Essay test. 
which are free from bias. 
12.5 Indicates confidence in biological data ob- 12.5 


tained by scientific procedures, 


Anecdotal records. Objective 
test. 


Relating Evaluation Procedures to Objectives 47 


havior Mr. Brown is willing to accept as evidence that the pupil knows the com- 
mon terms used in biology. If he believes that knowing terms requires that a 
oe able to define the terms in his own words, the only adequate procedure 
Rese nas is to ask the pupil to define the terms. The ability to recognize the 
mene oe could not be accepted as proof of the pupil's ability to provide 
t et efinition unless there was research evidence to indicate that the two 
ypes of behavior were highly related. Lacking such evidence. the only safe 
Procedure is to select the evaluation technique which appraises the specific 


learning outcome most directly. 
ao oun discussion has been focused on one l i 
is ator principle of appraising each learning outcome gs directly as poss? e 
tween : at characterizes the entire chart. For example, 6.1 Distinguishes be- 
anne ein and opinions” can be evaluated by objective tests. It is simply a 
dicate. presen hug the pupil with a number of statements and asking him to 
“6.2 e which are facts and which are opinions. On the other hand, however, 
bias s. valid conclusions from given data” a short-answer essay 
ecause the outcome indicates that the pupil will draw his own conclusions 


G . : 

es not merely recognize conclusions drawn by others. Similarly, all of the 

utcomes pertinent to “8. Pupil performs basic operations of dissection skill- 
hecklist or rating 


-s i be evaluated by some observation device such as a c s sf 
sectio Solea of dissection procedure cannot be accepted as evi meat . - 
sake pi ot. Knowledge of procedure can and should be measured or its es 
pupil ue skill can be evaluated only by directly observing and judging - 
atiq dissection procedure and the resulting product. In the area x "yw 
facts e, such learning outcomes as «12.1 Suspends judgment until al - 
diffic E available” require more than one type of evidence because o we 
fein ty of the evaluation. Anecdotal records based on daily observation i i - 
h room and laboratory will provide evidence concerning the pupil's typica 
to no in dealing with scientific problems. Because of the lack of opparin 
auhjestise. all pupils in situations requiring this behavior and . = 
items oe nature of such observations, it is also desirable to use © igo 
respon uch test items merely supplement the anecdotal records, Te eon 
ehav Ses to objective test items do not indicate how the Lek ke y 
ieihod when confronted with problems of a scientific nature. n pop Hn 
More | are inadequate but together they complement each other . = 
then adequate evidence than either would alone. For each gee hy jec i. 
ihe » the evaluation chart indicates the evaluation techniques W y cane 
lines: direct and adequate evidence concerning the extent to which the pupil s 
An pa corresponds to the desired learning outcomes. 
in chart, such as Mr. Brown’s, also ee 
instruction e evaluation program at the beginning o Ñx ee é 
ratin n. If evaluative data are to be obtained by means © anecdota recor s, 
m s a other observational devices, the nature of the observations 
elie s early in the instructional process. Ideally, the planning for 
n should occur at the same time aš other plans are made Íor the course. 


specific learning outcome, 


requires a 


clear the necessity for 
e unit, or course, of 


48 The Eualuation Process 


When this is done, teachers sometimes include the objectives of instruction, oe 
methods of instruction, and the evaluation techniques all together in one plan: 
The following chart illustrates a simplified version of a plan for Mr. Whiteside’s 
fourth-grade objective in arithmetic reasoning: 


Evaluation 
Objectives Teaching Methods Techniques 


1. Pupil demonstrates arithmetic 
reasoning ability when he: 


ll Can identify the problem Present the pupils with Observation and an- 
(what is unknown). a variety of story prob- ecdotal records. 
1.2 Can identify the relevant lems which contain Objective tests. 


known facts, more facts than are 


13 Can identify the arithmetic needed so that they 
Process which relates known obtain practice in iden- 


to unknown. tifying the problem, 
14 Can solve quantitative prob- selecting the needed 


lems using the above steps. facts, and selecting the 
arithmetic process, as 


well as computing the 
answers, 


dure, however, one must be careful not to try to 
too closely to specific objectives. One method (e.g 
i objectives, such as knowledge, under- 
adjustment, Similarly, one objective 
of a series of learning experiences 
his limitation, a plan such as that 

add general direction to both the teaching and 
the evaluation Process, 


(eg., appreciation) may be the end result 
requiring a multitude of methods. Within t 


of learning outcomes 
objectives should be 


Paper-and-pencil test; 
outcomes should be 


1R. M. Thomas, Judgi 


Relating Evaluation Procedures to Objectives 49 


al =a] test and which must be evaluated by other techniques. Once we have 
oni e es objectives and learning outcomes that can be measured by paper- 
ie tests, our main task is to develop a test which adequately measures 

Se desired changes in pupil behavior. This task can be greatly facilitated by 


means of a table of specifications. 


Table of Specifications 


Pana of specifications is a two-way chart which relates the desired learning 
The us es 2 the course content used to bring about these behavioral changes. 
pees of such a chart serves the test maker in much the same manner that a 
ei. oe the carpenter. It assists him in obtaining a finished product 
n = characteristics. In the case of the test maker, it provides greater 
Pie irae that his test will measure the learning outcomes and course content 
ced manner. 

Pee of a table of specifications for third-grade social studies is pre- 
lef af Table 3.1. Note that the major areas of content are listed down the 
Sich: ni e table and the objectives are listed across the top. The numbers in 
a x the cells in the table indicate the percentage of test item: b 
I = area of content and each type of objective. For example, with regard 
ict content area of “food,” 2 per cent of the items in the test would be con- 
witk . knowledge of common terms,” 6 per cent would be concerned 

nowledge of specific facts,” and 2 per cent would be concerned with 


s to be devoted 


Table 3.1 


TABLE OF SPECIFICATIONS FOR A THIRD-GRADE SOCIAL STUDIES TEST 


(IN PERCENTAGE) 


Objectives 
| understands Applies Interprets 
Knows Knows Principles Principles Charts 
Content Area riley p Ki J pe 1 ae | hal Total 
— 2 6 2 ji 10 
= 2 6 i 2 10 
Transportation | 4 2 2 | 2 5 15 
Communications 4 2 2 2 5 15 
Shelter P Ç I = 
a i T 4 2 6 | 8 20 
Farm Life ] 4 ° s a = 
u L 28 20 25 | 25 10 100 


50 ` The Evaluation Process 


aani 
“understands principles and generalizations.” The two empty cells in the “food 
row indicate that there would be no items allotted to these areas. _. 
The relative weight given to each area of content and each type of objective; 
in a table of specifications, is indicated by the total percentage of items devoted 
to each. For example, the right-hand column of Table 3.1 shows that 10 per cent 
of the items are to be concerned with the topic of “food.” 10 per cent with 
“clothing,” 15 per cent with “transportation,” and so on down the column. 
Similarly, the bottom row of the table shows that 20 per cent of the items er 
to be devoted to “knowledge of common terms.” 20 per cent to “knowledge © 
specific facts,” and so on across the row. 
be decided before the table is constructed. 
In assigning weights to the various areas of content, 


i š e 
These relative weights must. of course; 


a common procedure is 
it. The weights assigned 
teacher attaches to that 
en to it in his teaching. Thus, if he stresses 
of principles and generalizations, he et 
gly greater weight. In short, the areas ° 
e of specifications, should be assigne! 


Some teachers prefer to include in their table of specifications all of the ob- 
jectives of instruction, even though some of them cannot be measured by paper 
and-pencil tests. The table then becomes a general evaluation plan which includes 
the specifications for the test as well as suggestions concerning methods of evalu- 
areas. A simplified version of such a table, for 4 
hool science, is presented in Table 3.2. 
bjectives in a table 


i š š 8 
oÍ specifications provides 
what is not, 


: s. 
being measured by classroom test 


in the total evaluation process and, 


; Overemphasis on testing procedures. Each evalu- 
ation procedure is seen in its 


š š Proper perspective and its role in the total evalua- 
tion program is readily perceived. 


le of testing 


RELATING TEST ITEMS TO SPECIFIC 
LEARNING OUTCOMES 


the first step in relating testing procedures 


in relati 
s called forth 
learning outcomes? Althou 


ë Pa š 
answered as Positively as we would like, we can make special efforts to 


= 
Lic] 
3 
R 
S swoj ve z€ 02 ve sutə1] Jo 3099 Jad 
S oS ZI 91 or zI sway Jo Jaquinu [VOL 
° 
s s L [d z sprog 
” 
9 : 
8 uongidioəid 
° or s z I z pue QiptumH 
Š = — = 
(WstD929u9) (ə[g9s Burnes) 
x 8 z sjrdnd səoraəp ç ç z amwidusg 
z — H K a a = - 
S q parnaysuoo utnseaw Sursn 
3 & S g j sdew ə1en[pAS sjrdnd aasasqg 8 £ a; + pura 
Š IL £ £ € z Əimssəid ny 
S. < = sa — = = 
= sua% fo sdpu sdpu saomap uorpuof səb] Slu491 
2 Jaqunu dayvam dayywan Bu1mspəu 4ayjpam uo oyioads pup sjoquds quajuo) 
3 mo, Bulə41diə1u1 Bunənnsuoo fo əsn 1013Dj yova 
x fo aouanyfur 
ul mAs spusa pun smouy 
i = saanzalqQ 


3ON319S TOOHODS HSIN YOINAT NI LINN YAHLVAM V YOA SNOILVOIAINAdS AO ATAVL 


TE AEL 


52 The Evaluation Process 


i : vari ub- 
develop test items toward that end. The following examples, from various s “4 
ject-matter areas, reflect considerable success in this regard. In each mgr 
note how the specific learning outcome gives a precise description of the 


š k whi lls 
havior the pupil is to exhibit and how the test item presents a task which ca 
forth that behavior. 


EXAMPLES 


Specific Learning Outcome: Defines common terms. (Eleme 
Directions: In a sentence or two, define each of the 
1. Interest 


ntary Mathematics.) 
following words. 


Premium 
Dividend 
Collateral 
Profit 


sae 


i i ; ” nother. 
Specific Learning Outcome: Knows procedure for converting from one measure to a 
(Elementary Mathematics.) 


1. The area of a rug is given in square yards. How 
square feet? 
A multiply by 3 
multiply by 9 
C divide by 3 
D divide by 9 
2. The amount of milk a famil 
you change it to gallons? 
A multiply by 4 
B multiply by 8 
C divide by 4 
divide by 8 
The air space in a room is ex; 
to cubic yards? 
A multiply by 9 
B multiply by 27 
C divide by 9 ‘ 
divide by 27 


P of 
would you determine the number 


P ; š E would 
y drinks in one month is expressed in pints. How 


e it 
pressed in terms of cubic feet. How would you change 


+ . " " i ions. 
Specific Learning Outcome: Differentiates between relative values expressed in fractio 
(Elementary Mathematics.) 


1. Which one of the following fractions is smaller than one half? 
A 2/4 
B 4/6 
© 3/8 
D 9/16 
2. Which one of the 
@ 2⁄3 
B 4/7 
C 5/9 
D 9/16 
3. Which one of the following fractions has the same value as one fifth? 
A 2/20 ' 
B 5/50 
C 25/75 
® 20/100 


following fractions indicates the greatest value? 


| 


Relating Evaluation Procedures to Objectives 53 


Speci - rei oy ". 
m Learning Outcome: Distinguishes fact from opinii 
irections: Read each of the following statements care 


® a fact, circle the “F.” If you think the statement is 
O 1. George Washington was the first President 
F © 
® D 
O 4 
F Os 
Speci: š 
pecific Learning Outcome: Knows common uses of 
Science.) 
L i Za 
Which one of the following instrumen 


2 
3. Franklin D. Roosevelt was the only 
4 


Hawaii is the most beautiful 


A Wind vane 
Anemometer 
C Altimeter 
2 s Radar f 
i ch one of the following instruments 1s used to 
in the air? 
A Altimeter 
B Barometer 
Hygrometer 
5 D Radiosonde 
Tene Outcome: Identifies cause-and-effect re 
ions: In each of the following statements, 
are to decide if the second part explains w! 
the yes. If it does not, circle the no. 


hy the 


ts is used to determine th 


both parts of the statement are 


ion. (Elementary Social Studies.) 
fully. If you think the statement is 
an opinion, circle the “QO. 

of the United States. 


Abraham Lincoln was our greatest President. 

President elected to that office three times. 
Alaska is the biggest state in the United States. 

state in the United States. 


weather instruments. (Elementary 


e speed of the wind? 


determine the amount of moisture 


Jationships. (Elementary Science.) 
true. You 


first part is true. If it does, circle 


es. 


TD Examples: 
Ye: No 1. People can see because they have ey 
s 
Wo) 2. People can walk because they have arms. 


I 
š rte first example, the second pa 
«in 
the “yes” was circled. In the second example, the 


no ° s Pree 
t explain why “people can walk” so the “no” was c 


State 
ments and answer the same way- 


rt of the stateme! 


nt explains why “people can see” 
second part of the statement does 
ircled. Read each of the following 


the weather is hot in the 


Yes QJ 

ša 1. Some desert snakes are hatched because Š 

We from eggs esert. : 
No 2. Spiders are very useful because they eat harmful insects. 
No 3. Some plants do not need sunlight because they get their food from 

Y other plants. 

© 4. Water in the ocean evaporates because it contains salt. 
No 5. Fish can get oxygen from the because they have gills. 


water 


e Learning Outcome: Applies scie 
-© Which one of the following best exp 
on a bright, sunny day? 
A Transpiration 
B Plasmolysis 
© Photosynthesis 


D Osmosi 
2. u sis 
Which one of the following best explains why bre: 
room? 


© Some plants do not produce their own food. 
B Photosynthesis can take place in the dark. 


ntific concepts and p 
Jains why green al 


rinciples. (Biology.) 
gae give off bubbles of oxygen 


ad mold can be grown in a dark 


54 The Evaluation Process 


C Chlorophyll aids the growth of plants in darkness. 
D Bread mold takes in carbon dioxide and 


gives off oxygen in both darkness and 
light. 


Specific Learning Outcome: 
Directions: The items in t 
RESOLVED: The legal 


Recognizes the relevance of arguments. (Social Studies.) Eee. 
his part of the test are to be based on the following nen 
voting age in the United States should be lowered to “sassa 
Some of the following statements are arguments for the resolution, some are tae 
against it, and some are neither for nor against the resolution. Read each of the follo 
ing statements and circle: 
F if it is an argument for the resolution. 
A if it is an argument against the resolution. 
is neither for nor against the resolution. 
® A N 1. Most persons are physically, 


i i » by the 
emotionally, and intellectually mature by 
age of eighteen. 


FA 2. Many persons are still in school at the age of eighteen. 

FA 8 3. In most states it is legal to drive an automobile hy the age of eighteen. 
F @ N 4. The ability to vote intelligently increases with age. 

F A ® 5. The number of eightee 


S w y States is increasing 
n-year-old citizens in the United States is incr 
each year. 


test items should be related a. 
Although all subject matter areas and all types 0 


ç "sen š » the 
epresented, the basic principle is the same. State 
in behavioral terms and 


‘tems 
select or develop test item 
cific behavior. 


RELATING NONTESTING PROCEDURES 
TO SPECIFIC LEARNING OUTCOMES 
There are many areas in 


ating some performance skill 
to obse 


which testing procedures are not useful. In evalu- 
s (e.g., singing, dancing, speaking), it is necessary 
rve the pupil as he performs and to make judgments concerning the effec- 

ce. In other instances, it is possible to evaluate a pupil a 
ity of the product resulting from his performance (¢-8+ 
a typed letter, a baked cake, and so on). In evaluating 4 


š ig I 
: > it may be necessary to observe the pupil in forma 
and informal situations i 


and the specific learning outcomes- 
specific learning outcomes become 
In the following examples, note how 
ly a slight modification to become 


ting scales or checklists, the 
the dimensions of beha 
the specific learning 


! outcomes require on 
items in a r 


ating scale, 


a ; 
Relating Evaluation Procedures to Objectives 55 


Speech 


Speci 
ific Learni 
nin : : 
g Outcome: Maintains good eye contact with audience. 


Rating Scale Item 


Ho 
w effective i 
ive š A š 
is the speaker in mainta udience? 


ining eye contact with the a 


1 
2 3 4 5 


Ineffecti q 
clive 3 
Below average Average 


Above average Very effective 


Th 
eme Writing 


Speci 
ecific L 
earning O 
u ji eee š 
g Outcome: Organizes ideas in a coherent manner. 


Rati 
oe Scale Item 
oil ini 
ganization of ideas. 


1 
= 2 3 4 5 
Poor — == 
organizati Fair Clear. coherent 
Zalio er pe 
N organization organization 


G 
"oup Work 


Sp 
ecifi 
Learn; 
rn . HF + 
ng Outcome: Contributes wor roup discussion. 


thwhile ideas to 8 


ati 
ng Scale Item 
ow ofte: 
en does the pupil contribute worthwhile ideas to group discussion? 
1 
— 2 3 4 5 
Never I “ts ae = a sana "a 
Seldom Occasionally Fairly often Frequently 
ters. It is our 


ts are presented jn later chap 


nontesting procedures can 
Specify th : to evaluate. The specific 

e behavior to be observed and the rating scale prov 
dgments. Such judgments are, of course, 
s objective as possible by clearly 
ished to observe and then deliber- 


be related to 
learning out- 
ides a con- 


Co 
mplet 
er 
urpose eee ae scales and checklis 
$ e specific | to merely illustrate how 
Then earning outcomes we wish 
Nien 
t 
sti m 
till aa of recording our ju 
. YJecti 
ve, but we have made them a: 


nin, 
a g th 
tely o Pa wa. of pupil behavior we W 
ing those behaviors in pupils. 
RELATING STANDARDIZED TESTS 
TO LOCAL OBJECTIVES 
ossible to the 


directly as P. 
ured is not limited to 


or consideration when 
1 purposes: Ideally, a 
and the behavioral 
The degree 


teach, Ves and speci Ç echniques aS 
Selects, de oe learning outcomes to be measu 
Stand. n Standar at This type of relevance 15 also a maj 
a ardized fe dized achievement tests for instruction® 
Nges which st should measure the subject-matter content 
have been emphasized in the instructional p 


The ; 
ei 

obian, Port 

Jecti ance of relating evaluation t 


rogram. 


56 The Evaluation Process 


to which a test meets this ideal can be determined only by a careful and sys 
tic examination of the test. : 4 
gs judging the relevance of a standardized test to the et poga 
it is desirable to analyze the test item by item. As each item is studied si pay 
should be made of the subject-matter content and the behavioral cha eal 
seems to measure. This tabulation can later be compared to the areas mean 
in the instructional program to determine the degree to which bi A 
and emphasis are adequate. If a table of specifications has been prepa 
course, the test analysis can be compared directly to the table. oth the 
We seldom expect to find a standardized test in perfect or eee 
objectives and subject-matter content emphasized in a particular cours 
riculum. However. an analysis of the test 
the test actually does measure w 
areas are neglected, and which a 
is useful in interpreting the 
ation devices, 


ine r well 

items will help determine sie 
ae sti 

hat we want to measure, which instruc ee 
This i rma 

reas receive too much stress. This infor aie 
i i ementary eva 

test results and in developing supplementary 


EVALUATION ON A BROADER SCALE 
The major theme runnin, 


z gy: tL) 
part of the teaching-learning process and that it involves two basic til ee 
identifying and defining the objectives of instruction, and (2) tee ce 
selecting evaluation instruments which best appraise these objectives. oe for 
primary emphasis is on the extent to which the specified learning ern poe 
a particular course or curriculum have been achieved. In a recent artic t that 
cerning evaluation and course improvement, Cronbach has pointed a ihaz 
there are times when it may be desirable to evaluate outcomes beyond 


we 
š y: ë rovocall 
which have been set for a given course or curriculum. Note these pro 
comments.” 


: ¿awa Š tegra 
g throughout this book is that evaluation is an integ 


In course evaluation, we need not be 
fit the curriculum, 
the principles of e 
know what change: 
measures of all the types of proficiency 
question, not just the selected outcome: 
tion. If you wish onl 


ts 
much concerned about making measuring ka apr 
this declaration may seem, and however es: td 
valuation for other purposes, this must be our position if we wa Jude 
S a course produces in the pupil, An ideal evaluation would ihe in 
that might reasonably be desired in the spend 
s to which this curriculum directs substantial a 


However startling 


e 
: K h ens 356 fit th 
Y to know how well a curriculum is achieving its objectives, you 

test to the curriculum; but if you wi 


national interest, y 
new mathematics course attempt to teach numerical trigonometry, a 
indeed, might disca 

how well graduat 


; ro- 
come through the new course are fairly P 


i H ; : ` n: red- 
mputation despite the lack of direct teaching, the doubters will be reassu 
» the evidence makes clear how much is bei 


Ing sacrificed. 
mb JL Cronbach, 


d, 
“Course Improvement Through Evaluation,” Teachers College Recor 
64, 680, 1963. 


Relating Evaluation Procedures to Objectives 57 


y concerned with the evaluation of large- 


Alt 
hough these comments are directl 
generally applicable. 


scale c ñ 
urric š : 
For some culum improvement projects, the basic idea is 
> purposes, i N : ° F 
purposes, it may be appropriate to determine pupil progress toward 


objecti 
ves other s š 

than those specified for a course or curriculum. Where this is 
luation instruments as closely 


do i 
ne, it is 

s, of course, necessary to relate the eva 
those which have 


as possibl 
e to all of the outcomes to be measured; not just 


been j 
n identifie: š 
tified as proper instructional goals. 


SUMMARY 


If = 

ee objectives are to function most effectively in evaluation, a con- 
learnin must be made to relate the evaluation procedures to the specific 
(1J ü sensasi e i ive. This can be facilitated by 
of evaluation aire plan. (2) a table of specifications. and (3) a selection 
i A Seniesa] dee: mugues which measure each learning outcome most directly. 

ing outcomes a mos plan consists of a list of all objectives and specific learn- 
or each lear wath an indication of the type of evaluation technique to be used 
Achieving ‘hen ote outcome. For teaching purposes: the methods to be used in 
evaluation ne objectives may also be included. The development of a general 
Objectives Ren assures that provision has been made for evaluating all of the 
and it alerts the teacher to types of evaluative information that must 


© gathe š 
ho ap Paru throughout the semester. 
This is of specifications is especially useful in planning the classroom test. 
a twofold table which relates the objectives of the course to the subject- 
n construct- 


Ë ou i 
tcomes encompassed in each object 


Matt 
matter co ; 
Ing a te ey used to achieve the objectives. It guides the teacher i 
s ; E 
which measures a representative sample of the objectives and the 
ya paper-and-pencil 


asurable b 


cou 
Tse 
con . š P 
tent which have been identified as me 
a more genera 


test. T 
Plan an table of specifications may be expanded into 
he i E all objectives of the course. 
al step in relating evaluation pr 
aluation tec 
a special eff 
lar to the 
e be fairly certai 
have identified 


] evaluation 


ducational objectives 
be used. In the case 
‘ort must be made to obtain 
behavior described in the 
n that we are evalu- 
as our instructional 


is in the š ocedures to € 
f Si selection of the specific ev hnique to 
Samples an items and nontesting devices: 
Specific le Pupil behavior which are simi 
ating hnil outcomes. Only then can W 
Beals, progress toward the objectives We 
determining the extent to which 
as other than those 


In 

Some i ; 
instances, we might be interested in 

luation procedures 


Cour 
Se o n a K ñ 
toward r curriculum is modifying pupil behavior in are 
Which g ed. This requires eva 
6 a eee 
; beyond the desired goals of instruction, but the basic principle of 
he oute asured is still 


relating ih 
e evaluation instruments to t 


whic Š à 
ch our teaching is direct 


omes to be me 


READING 


FOR FURTHER 
” Teachers College Record, 61, 


SUGGESTIONS 


e Assessment of Academic Achievement 


58 The Eualuation Process 


Ferris, F. L. “Testing in the New Curriculums: Numerology, ‘Tyranny,’ or Common Sense? 

The School Review, 70, 112-131, 1962. 

Findley, W. G. (ed.) The Impact and Improvement of School Testing P. 
Yearbook of the National Society for the 
University of Chicago Press, 1963. Chapter 
grams to Educational Goals.” 

'urst, E. J. Constructing Evaluation In 
ter 7: “Planning the Test.” 


rograms, Sixty-second 
Study of Education. Part II. Chicago: The 
2: Ebel. R. L.. “The Relation of Testing Pro- 


rstruments. New York: Longmans, Green, 1958. Chap- 
` Chapter 8: “Constructing Items to Fit Specifications. 


References Illustrating How Evaluation Procedures 
Are Related to Objectives 


loom, B. S. (ed.) Taxono 


my oj Educational Objectives: Handbook 1, Cognitive Domain. 
New York: Longmans, Green, 1956. 


. > 5 “a " £ . tion. 
erberich, J. R. Specimen Objective Test Items: A Guide to Achievement Test Constructi 
New York: Longmans, Green, 1956, 


a a- 
enry, N. B. (ed.) The Measurement of Understanding, Forty-fifth Yearbook of the N 
tional Society for the 


Study of Education, Part I. Chicago: The University of Chicago 
Press, 1946. paeen 

rathwohl, D. R., B. S. Bloom, and B. B. Masia. 4 Taxonomy of Educational Objectives: 
Handbook II, The Affective Domain. New York: David McKay, 1964. sizal 
orse, H. T., and G. H. McCune. Selected Items for the Testing of Study Skills and Critic 


esa & Tiii a- 
Thinking. National Council for the Social Studies. Washington, D.C.: National Educ 
tion Association, 1957. 


e most important 
rticular uses for 


n instrument th 
Its serve the pa 
f validity. 


In 
selecti 
Question i ng or constructing an evaluatio 
Š s: 
which the To what extent will the resu 
y are intended? This is the essence 0 


chool, and the results 
achievement may be 
to determine progress 
asured in order 
roup pupils for instruc- 
velopment may be ob- 
hem for referral to a 


sig ein of pupil behavior are evaluated in the $ 
“val . of uses. For example, 

difficulties or 
titude may be me 


k uated S to serve a variety 
Ward inst ordèr to diagnose learning 
t Predict ructional objectives; scholastic ap 
ional pur success in future learning activities OF to g 

Ñ Poses; and appraisals of personal-social de 


tain 
ed in 
* o 
Suidance ç rder to-better understand pupils or to screen t 
ounselor. Regardless of the area of behavior being evaluated, however, 


or th 
© use š 
evaluation to be made of the results, all of the various procedures used in an 
essentia] A an should possess certain common characteristics. The most 
these characteristics can be classified under the headings of validity, 


reliab;1; 

Serve t = refers to the extent to which 
Used to ae HE uses for which they 
ay ie E pupil achievement, we sh 
Wish to des we wish to describe, to represent al 
Similar to raat and to represent nothing else. 
e defense attorney in the courtroom who 


truth 

> and š 

S n 

Success į othing but the truth. If the results are to 
hould like them to provide as truthful an 


dicatio 
Serneq Basically. then, validity is always con- 
í the results and with the truthfulness 


the results of an evaluation procedure 
are intended. If the results are to be 
ould like them to represent the specific 
] aspects of the achievement we 
Our desires in this regard are 
wants the truth, the whole 


be used to predict pupil 


“4 . future activity. we s 

ih Gi ure success as possible. 

our pro = Specific use to be made o 
Posed interpretations. 


59 


60 The Evaluation Process 


x obtain quite 

Reliability refers to the consistency of evaluation — re San 1 
similar scores when the same test is administered to (ee bs ° = oe 
different occasions, we can conclude that our Te ae le ides 
reliability from one occasion to another. Similarly, i = nai deta cite 
pendently rate the same pupils on the same Es oe Bae reliability from 
ratings, we can conclude that the results have a hig! legre a ie E 
one rater to another. As with validity, reliability is ae owe akii 
type of interpretation to be made. For some uses, we may e ope ton, und fol 
how reliable our evaluation results are over a given — I one In 
others, how reliable they are over different samples of : e = a kawana 
all instances in which reliability is being determined, mans ens vo Whih 
cerned with the consistency of the results, rather than with the ex 
they serve the specific use under consideration. fie, dh condi: tae asin has dË 

Although reliability is a highly desired quality, it shou e 
ability provides no assurance that evaluation results m a N aou that Bë 
information. As with a witness testifying in a courtroom tria — S emi 
consistently tells the same story does not guarantee that he is te prha oe 
The truthfulness of his statements can be determined only by soe oe Ban 
with some other evidence. Similarly, with evaluation results je ia 
important quality but only if it is accompanied by truthfulness, p Ws 3 evali- 
dependently. Little is accomplished 
e wrong information. 


NATURE oF VALIDITY 


m validity, in relation to testin 
to be borne in mind. 


When using the ter 


i are 
8 and evaluation, there 
a number of cautions 


+a: he 
° to speak of the validity of i 
» OF more specifically, of the validity of the interpretation to be ma 
from the results, 


" sis. 
t exist on an all-or-none ba 


Consequently, we should av 
Validity is best considered in terms of cat 
high validity, moderate validity, and low validity. 
. 3. Validity is always specific to some particu 
sidered a general quality. For example, 
have a high degree of validity for indicati 
of validity for indicating arithmetical reasoning. 


for predicting success in future mathematics courses. 
when appraising or describing validity, 


e of the results. Evaluation results 
ee of validity for each particular 


dicting success in art or music. Thus, 
it is necessary to consider the use to be mad 
are never just valid; they have a different degr 


The Validity of Evaluation Results 61 


use to which they are put. 


Four types of validity have been identified 
educational and psychological measurement.’ 
dictive validity, concurrent validity, an 
ing of these types of validity is indicated in 
plained more fully in the remainder of this cha 
discussion will be limited to validity as it 
be recognized, however, that these four types o 
the various kinds of evaluation instrumen 


the resu 


oid thinking of evaluation results as valid or invalid. 
egories that specify degree, such as 


lar use. It should never be con- 
lts of an arithmetic test may 
ng computational skill, a low degree 
a moderate degree of validity 


, and no validity for pre- 


TYPES OF VALIDITY 


Table 4.1 


and are now commonly used in 
They are: content validity, pre- 
d construct validity. The general mean- 
Table 4.1. Each type will be -ex- 
pter. For the sake of clarity, the 
relates to testing procedures. It should 
f validity are also applicable to 
ts used in the school. 


FOUR TYPES OF VALIDITY 


Meaning 


Procedure 


Content Validity 


Predictive Validity 
(Criterion-related) 
Concurrent Validity 


(Criterion-related) 


Construct Validity 


How well the test measures the 


subject-matter content and be- 

haviors under consideration 
How well test performance pre- 

dicts some future performance 


l test performance com- 


How wel 
t current 


pares with some othe 
performance 


How test performance can be 
described psychologically 


Compare test content to the uni- 
verse of content and behaviors 
to be measured 

Compare test scores with another 
measure of performance ob- 
tained at a later date 

Compare test scores with an- 
other measure of performance 
obtained at approximately the 
same time 

Experimentally determine what 
factors influence scores on the 


test 


oe Educational Research Association 
aor Education, Technical Recommendations 
a a! Education Association, 1955). American Ps; 
mmendations for Psychological Tests an 


Psychological Bulletin, 51, 1954. 


d Diagnostic Techniques,” 


and National Council on Measurements 
for Achievement Tests (Washington: Na- 
ychological Association, “Technical Rec- 


Supplement to the 


62 The Evaluation Process 


Content Validity 


The content of a course or curriculum may be broadly defined to nee os 
subject-matter content and instructional objectives. The former is aie be- 
the topics. or subject-matter areas, to be covered. and the latter wi aie 
havioral changes sought in pupils. Both of these aspects of content eed 
cern in determining content validity. We should like any achievemen eeu 
construct, or select. to provide results which are representative of the ap More 
behaviors we wish to measure. This is the essence of content valine oe 
formally, content validity may be defined as the extent to which a test meas 


S 
s: 7 change 
@ representative sample of the subject-matter content and the behavioral 

under consideration. 


As might be expected, content validit 
testing. The procedures used are those 
test is examined to determine the sub 


y is of primary concern in chine 
of logical analysis and campain A 
ject-matter content covered and ih 
Sponses pupils are intended to make to the content. and this is compare 
the domain of achievement to be 
in a rather haphazard manner, gre. 
if the following steps are followed. 


zaoa AONE 
1 1 S 

measured. Although this is ae aš 
¿Pasa Se aly 
ater assurance oÍ content validity is obta 


l. The major topics of sub 


; mes of be 
ject-matter content and the major types 
havioral changes to be measu 


«abi 
red by the test are separately listed. The 
are usually derived from the topical content and the objectives ae 3 
the instructional program. If the test is to measure achievement in a SI is 10 
course, the individual teacher involved might develop the lists. If the test hai 
be used on a school-wide basis, the preparation of the lists might best be 
dled by a committee of teachers. 

2. The various subject-matter to 
weighted in terms of thei 
for determining appropri 
haviors. It de 
devote 


re 
pics and types of behavioral changes hae 
r relative importance. There is no simple proce¢ he: 
ate relative weights for the various topics an imë 
pends on personal judgment as guided by the amount of 


“sapa à š inion 
d to each area during instruction, the philosophy of the school. the op 
of experts in the area, and similar criteria. 


3. A table of specifications, 
the weighted lists of subject-m 
then, specifies the relati 
matter topic and each ty 

4. The achievement 
table of specifications. 
cated in the table, the 
test will have a high deg 


: š ilt from 
like the ones Presented in Chapter 3, is built aise 
atter topics and expected behavioral changes. 


+ oct: 
ve emphasis the test should give to each subje 
pe of behavioral change. 


test is constructed, or selected 
The closer the test corresponds 


greater the likelihood that the 
ree of content validity. 


+h the 
+ in accordance with ms 
to the specifications as 
pupils’ responses to 


A table of specifications 
to illustrate how such a t 
centages in the table indic 


2 


The Validity of Evaluation Results 63 


area and each type of behavioral change is to be given in the test. Thus, if the 
e of subject-matter content, 15 per cent 


test is to measure a representative sampl 
15 per cent with animals, 30 


of the test items should be concerned with plants, 
e Tar with weather, 15 per cent with the earth, and 25 per cent with the sky. 
cad T A is to measure a representative sample of behavioral changes, 50 per 
ú e items should measure the “understanding of concepts,” and 50 per 

nt should measure the “application of concepts.” This, of course, implies that 
the specific emphasis on “understanding” and “application” for each subject- 


matter area will follow that indicated by the percentages in the table of specifi- 


cations. For example, 10 per cent of the test items concerned with plants should 
measure “understanding of concepts,” and 5 per cent of the test items should 


sn ü Seia 
easure “application of concepts.” 


Table 4.2 


E EMPHASIS TO BE GIVEN TO THE VARIOUS 
IN BEHAVIOR 


TABLE SHOWING THE RELATIV. 
TTER AREAS AND TO THE CHANGES 


EST IN ELEMENTARY SCHOOL SCIENCE 


SUBJECT-MA 
FORAT 


Changes in Behavior (in Percentage) 


Subject-matter Understands Applies 
Areas Concepts Concepts Total 
Plants 10 5 15 
Animals 10 5 15 
Weather 15 15 30 
Earth 5 10 15 
Sky 10 15 25 
Total 50 50 100 


é 
It should be noted that this procedure merely provides a rough check on 


content validity. Such an analysis reveals the apparent relevance of the test 
items to the subject-matter areas and behavioral changes to be measured. Con- 
the test items actually do call 


tent validity is concerned with the extent to which 
forth the responses represented in the table of specifications. Test items may 
appear to measure “understanding” put not function as intended because of 
defects in the items, unclear directions, inappropriate vocabulary, or poorly 
controlled testing conditions. Thus, content validity is dependent on a host of 
factors other than the apparent relevance of the test items. Most of what is 
written in this book concerning the construction and selection of achievement 
tests is directed toward improving the content validity of the obtained results. 
Although our discussion of content validity has been limited to achievement 
testing, content validity is also of some concern in the measurement of aptitudes, 


64 The Evaluation Process 


i e 
interests, attitudes, and personal-social adjustment. For example, p! sae, 
selecting an interest inventory we should like it to cover those aspects $ w. 
with which we are concerned. Similarly, an attitude scale should include ak 
attitudinal topics that are in accord with the objectives we wish to page s 
The procedure here is essentially the same as that in achievement ee a 
matter of analyzing the test materials and the outcomes to be measur 
judging the degree of correspondence between them. 


Predictive Validity (Criterion-Related) 


1 : jeve- 

While content validity is most easily understood in conjunction with si 
ment testing, predictive validity can be most clearly illustrated in the ar Hal 
aptitude testing. As indicated in the first chapter, an aptitude test is m ome 
predicts success in some future activity. Thus, by its very nature, aptitud 
ing is dependent upon predictive validity. determine 

Suppose that Mr. Young, a junior high school teacher, Wanle to in his 
how well scores from a certain scholastic aptitude test predict success hei 
seventh-grade arithmetic class. Since the scholastic aptitude test is ioe mall 
to all pupils when they enter junior high school, these scores are read! 4 ait 
able to Mr. Young. His biggest problem is deciding on a criterion of i = Me 
achievement in arithmetic. For lack of a better criterion, Mr. Y deci aie 
use a comprehensive departmental examination that is administered to the ne 
ous seventh-grade arithmetic sections at the end of the school year. It pee 
possible for Mr. Young to determine how well the scholastic aptitude test A api 
predict success in his arithmetic class by comparing the pupils’ scholastic those 
tude test scores with their scores on the departmental examination. Do high 
pupils who have high scholastic aptitude test scores also tend to m ae 
scores on the departmental examination? Do those who have low schola 
aptitude test scores also tend to have low scores on 
tion? If this is the case, Mr. 
tude test scores tend to be ac 
class. In short, he recognize: 
Predictive Validity, 
is accurate in predic 

In our illustratio 
test scores and the 
them. Although thi 


for indicating pred 
th 


ihe departmental exam 
Young is inclined to agree that the scholastic aad 
curate in predicting achievement in his A 
s that the test results possess predictive vali ns 
then, may be defined as the extent to which test performa" 

ting some future performance. ;tude 
n above, Mr. Young merely inspected the scholastic aptit i 
achievement test scores to determine the agreement ret 
s may be a desirable preliminary step, it is seldom suffici l 

ictive validity. The usual procedure is to correlate stanen 
e two sets of scores and to report the degree of relationship between them p 
means of a correlation coefficient. This enables predictive validity to be P 
sented in precise and universally understood terms. They are, of coum 
“universally understood” only by those who understand and can interp 
correlation coefficients. This should pose no great problem, however, since $ 


. . . mm 
meaning of correlation coefficient can be easily grasped by persons whose co 
putational skill goes no further than that of simple arithmetic. 


The Validity of Evaluation Results 65 


Rank-Difference Correlation. To clarify the calculation and interpretation 
of correlation coefficients, let’s consider the exact scores Mr. Young’s pupils 
received on both the scholastic aptitude test and the departmental examination 
In arithmetic. This information is provided in the first two columns of Table 4.3. 
By inspecting these two columns of scores, as Mr. Young did, it is possible to 
TONE that high scores in Column 1 tend to go with high scores in Column 2. 
This comparison is difficult to make. however, since the size of the test scores 


in the two columns are different. 


Table 4.3 


TEST SCORES AND TEST-SCORE RANKS FOR T 
JUNIOR HIGH SCHOOL PUPILS 


WENTY 


1 2 3 4 5 6 

Fall Spring (D) (D?) 

Pun; Aptitude Arithmetic Aptitude Arithmetic Difference Difference 

upil Scores Scores Rank Rank in Rank Squared 
John 119 77 1 3 = 4 
Henry 18 76 2 4 2 4 
Mary is 72 3 6 —3 9 
Susan 15 67 4 8 —4 16 
au 112 82 5 1 4 16 
i 109 63 6 10 =i 16 
nace 108 60 7 12 —5 25 
Ralph 106 78 8 2 6 36 
Jane 105 69 9 7 2 4 
Karl 104 49 10 18 = 64 
Jim 102 48 11 19 —8 64 
Frank 100 58 12 14 —2 4 
Karen 98 56 13 16 =$ 9 
Joan 97 57 14 15 a1 1 
Ruby 95 74 15 5 10 100 
June 94 62 16 11 5 25 
Helen 93 46 17 20 —3 9 
George 91 65 18 9 9 81 
Alice 90 59 19 13 6 36 
Martin 89 54 20 17 3 9 
=D? = 532 


The agreement of the two sets of scores can be more easily made if the test 
scores are converted to ranks. This has been done in Columns 3 and 4 of 
Table 4.3. Note that the pupil who was first on the aptitude test ranked third on 
the arithmetic test; the pupil who was second on the aptitude test ranked fourth 
on the arithmetic test; the pupil who was third on the aptitude test ranked sixth 
on the arithmetic test; and so on. Comparing the rank order of the pupils on 
the two tests, as indicated in Columns 3 and 4 of Table 4.3, gives us a fairly 
good picture of the relationship between the two sets of scores. From this inspec- 


66 The Evaluation Process 


š i ]so had 
tion we know that pupils who had a high standing on the a on 
a high standing on the arithmetic test, and pupils who ha od Our inspection 
the aptitude test also had a low standing on the ere te ship between the 
of Columns 3 and 4 also shows us, however, that the re me bi rank order 
pupils’ ranks on the two tests is not perfect. There is some š] 1 son che degrét 
from one test to another. Our problem now is:—How can w a at This is 
of relationship between these two sets of ranks in meaningful terms? 
where the correlation coefficient becomes useful. na the egre 

The rank-difference correlation is simply a method of sapresang a anl 
of relationship between two sets of ranks. The steps in argent 
difference correlation coefficient are presented in the following segare It 
guide.* Mr. Young’s data, in Table 4.3, are used to illustrate the athe cor- 
will be noted that the Greek letter rho (p) is used to identify a rank- And thet 
relation coefficient. From our computations for Mr. Young’s data he degrëe a 
p = .60. This correlation coefficient is a statistical summary of fas p artieular 
relationship between the two sets of scores in Mr. Young’s data. In hip PA Jieto 
instance, it indicates the extent to which the fall aptitude test scores Ñ ¿t refers 
are predictive of the spring arithmetic test scores (criterion). In short, 
to the predictive validity of the aptitude test scores. 


COMPUTING GUIDE: RANK-DIFFERENCE CORRELATION 


-Pakta 
Steps Results in Table 
2 
1 and 
l. Arrange pairs of scores, for each pupil, in columns. a 3 and 
2. Rank pupils from 1 to N (number in group) for each set of Columns 
scores. . amsa 
3. Find the difference (D) in ranks by subtracting the rank in Colum 
the right-hand column (Column 4) from the rank in the left- 
hand column (Column 3). š 6 
4. Square each diflerence in rank (Column 5) to obtain differ- Column 
ence squared (D2), í 
5. Sum the squared diflerences in Column 6 to obtain =D?. Bottom o 
Column 6 
6. Apply the following formula: 
6 x XD: 6 x 532 _ 
p (rho) =l ae = p=1— 201202 — 1) 
3192 
= = Sum of = 1— x80 
D = Difference in rank =1— 40 
N = Number in group = 260 pee 


How good is a predictive validi 
happy with this finding or 
aptitude test provide a good 

Unfortunately, simple and 


3 parto þe 
ung 

ty coefficient of «60? Should Mr. Younk 

should he be disappointed? Does this Pa ue 

prediction of future performance in SS suc 

Straightforward answers cannot be given 

2 Correlation coefficients ma 


„hich 
agge W 

y also be determined by the product-moment techniq 

is easier to apply to large gro 


ups. See the computing guide in the appendix. 


The Validity of Evaluation Results 67 


questions. The interpretation of correlation coefficients is dependent upon infor- 
manon from a variety of sources. First, we know that the following correlation 
coefficients indicate the extreme degrees of relationship that it is possible to 


obtain between variables. 


1.00 = perfect positive relationship 
.00 = no relationship 
—1.00 = perfect negative relationship 


Since Mr. Young’s validity coefficient is 60, we know that the relationship is 


Positive but somewhat less than perfect. Obviously, the nearer a validity coeff- 
t because larger validity co- 


a approaches 1.00 the happier we are with i 
elicients indicate greater accuracy in predicting from one variable to another.® 


Another way of evaluating Mr. Young’s validity coefficient of .60 is to com- 
pare it to the validity coefficients obtained with other methods of predicting 
Performance in arithmetic. If this validity coefficient is larger than those obtained 
with other prediction procedures, Mr. Young will continue to use the scholastic 
aptitude test as the best means available to him for predicting the arithmetic 
Performance of his pupils. Thus, validity coefficients are large or small only in 
relation to each other. Where predictive validity is an important consideration, 
We shall always consider more favorably the test with the highest predictive 
validity. In this regard, even aptitude tests with rather low predictive validity 
may be useful if they are the best predictors available and the predictions they 


Provide are better than chance.* 
Probably the easiest way of grasping the practical meaning of a correlation 
coefficient is to note how the accuracy of prediction increases as the correlation 


coefficient becomes larger. This js shown in the various charts presented in 
Table 4.4. The rows in each chart represent the fourths of a group on some 
Predictor (such as a scholastic aptitude test) and the columns indicate the per- 
centage of persons falling in each fourth on the criterion measure (such as an 
achievement test). First note that for a correlation coefficient of .00, being in 
the top quarter on the predictor provides no basis for predicting where a per- 
son might fall on the criterion measure. His chances of falling in each quarter 
are equally good. Now turn to the chart for a correlation coefficient of .60. 
Note, here, that if a person falls in the top quarter on the predictor, he has 54 
chances out of a 100 of falling in the top quarter on the criterion measure, 28 
chances out of 100 of falling in the second quarter, 14 chances out of 100 of 
falling in the third quarter, and only 4 chances out of 100 of falling in the bottom 
quarter. The remainder of the chart is read in a similar manner. 

By comparing the charts for the different-size correlation coefficients, it is 
possible to get some feel for the meaning of correlation coefficient in terms of 


«as š lation coefficien 
prediction efficiency. As the correla t becomes larger, a person's 


“A coefficient of —1.00 would also give us perfect prediction from one variable to another. 
nt we are most commonly concerned with positive relationshi ç 
ships. 


but in educational measureme N s 
3 L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960) 


68 The Evaluation Process 


er is on the 
chances of being in the same quarter on the criterion measure as he is aa 
predictor are increased. This can be seen by looking at the pae li, 
diagonal cells. With a correlation coefficient of 1.00, each diagonal ce 


apan a ediction from 
of course, contain 100 per cent of the cases—indicating perfect prediction 
one measure to another. 


Table 4.4 


F TS” 
PREDICTION EFFICIENCY FOR DIFFERENT-SIZE CORRELATION COEFFICIENTS” 


Quarter on Criterion 


Quarter on Criterion 


4 3 2 1 


Quarter 1 Quarter 1 


on 2 on 2 


Predictor 3 Predictor 3 


4 4 
Quarter 1 Quarter 1 
on 2 on 2 
Predictor 3 Predictor 3 
4 4 


Ë aren 
* Adapted from tables in R. L. Thorndike and E. Hagen, Measurement and Evaluate’ i, 
Psychology and Education (Ne: 


ers | 
l w York: John Wiley & Sons, 1961), page 171. Numb ch 
eap cell were adjusted to nearest whole number to provide 100 cases in each row an 
column. 


measure can also be shown b 
the one shown in Fi 


The Validity of Evaluation Results 69 


manner. Thus, each tally mark in Figure 4.1 represents how well each of Mr. 
| and spring tests. The total number 


nd row, have also been indicated. 

an be used directly as an expectancy 
h cell. The interpretation of such 
those pupils who scored above 
65 on the spring arithmetic 


Young’s twenty pupils performed on the fal 
of pupils in each cell, and in each column a 

The expectancy grid shown in Figure 4.1 ¢ 
table, simply by using the frequencies in eac 
information is simple and direct. For example, of 
average on the fall aptitude test, none scored below 
test, 2 out of 5 scored between 65 and 74, and 3 out of 5 scored between 75 and 
84. Of those who scored below average on the fall aptitude test, none scored in 
the top category on the spring arithmetic test and 4 out of 5 scored below 65. 
These interpretations are limited to the group tested but from such results one 
might make predictions concerning future pupils. We can say, for example, that 
pupils who score above average on the fall aptitude test will probably score above 
average on the spring arithmetic test. Other predictions can be made in the 
same way by noting the frequencies in each cell of the grid in Figure 4.1. 


Fall Aptitude 
Scores 


Above Average 
(over 110) 


Average 
(95-110) 


Below Average 
(below 95) 


Figure 4.1. Expectancy grid showing how scores on the fall aptitude test and spring 
arithmetic test are tallied in appropriate cells. (From data in Table 43.) 


in an expectancy table are expressed in per- 
centage. This is readily obtained from the grid by converting each cell fre- 
quency to a percentage of the total number of tallies in its row. This has been 
done for the data in Figure 4-1 and the results are presented in Table 4.5. The 
first row of the table shows that of the 5 pupils who scored above average on 
the fall aptitude test, 40 per cent (2 pupils) scored Eee a Sees 
Spring arithmetic test, and 60 per cent (3 pupils) scored between 75 and 84. 
The remaining rows are read in a similar manner. The use of percentage makes 
the figures in each row and column comparable. Our predictions can then be 
made in standard terms (that is, chances out of 100) for all score levels. Our 
interpretation is apt to be a little clearer if we say Henry’s chances of being in 
the top group on the criterion measure are 60 out of 100 and Ralph’s are only 
10 out of 100, than if we say Henry’s chances are 3 out of 5 and Ralph’s are 


l out of 10. 


More commonly, the figures 


70 The Evaluation Process 


Table 4.5 


z JEEN FALL 
EXPECTANCY TABLE SHOWING THE RELATION BETWEE is 
APTITUDE SCORES AND SPRING ARITH METIC SCORES 


— 
SS. ksi; Percentage in Each Score Group on Spring 
Fall Arithmetic Test ee 
e 45-54 55-64 65-74 8 
Whee Aera, ——— I 40 60 
sss 20 50 20 10 
Son de tee 40 40 20 


(Below 110) 


* From data in Figure 4.1. 


w the 
Expectancy tables take many different forms and may be used to beer ath 
relation between various types of measures. The number of categories voir 
the predictor, or criterion, may be as few as two or as many as Beers “establish 
Also, the predictor may be any set of measures for which we wish to 5 eet 
predictive validity and the criterion may be course grades, ratings, tes 
or whatever other measure of success is relevant." ve. like 
When interpreting expectancy tables based on a small number of oan 
Mr. Young’s class of twenty pupils, our predictions should be regarded as large 
tentative. Each percentage is based on so few pupils that we can expec! ently 
fluctuations in these figures from one group of pupils to another. It is frequ 


possible to increase the number of pupils represented in the table by combining 
veral clas 

course, much more st. 

fidence. In 


ses. Where this is done, our percentages en 
able and our predictions can be made with greater © 
expectancy tables provide a sim 
e validity of test results, 


Concurrent Validity (Criterion- 


Most of what h 


Ë s O 
any event. ple and direct mean 
indicating the predictiy 

Related) 


as been said about 


ofa validity coeffi 
coefficients are in 
current validity js 
the obtaining of e 


ss. The results are reported in ce 
rpreted in the same way that other vala 
ain difference between predictive and cor 
between the administration of the test an 


terpreted. The m 
in the time span 
vidence on the c 


Bea ; icti idity» 
° riterion measure. With predictive valid k 
evidence on the criterion measure is obtained Some time after the test resu 

are obtained and the test results i 


The Validity of Evaluation Results 71 


the test results are obtained at approximately the same time. Thus, the test 


results are related to present performance on the criterion. 

Concurrent validity is usually determined where a test is being considered as 
a replacement for a much more time-consuming method of obtaining informa- 
tion. For example, Mr. Brown, the biology teacher, wondered if an objective 
test of study skills could be used in place of the elaborate observation and rating 
procedures he was presently using. He felt that if a test could be substituted for 
the more complex procedures, he would have much more time to devote to 
individual pupils during the supervised study period. An analysis of the specific 
pupil behaviors on which he rated the pupils’ study skills indicated that many 
of the procedures could be stated in the form of objective test questions. Con- 
sequently, he developed an objective test of study skills which he administered 
to his pupils. To determine how adequately his test measured study skills he 
correlated the test results with his ratings of the pupils’ study skills. A resulting 
correlation coefficient of .75 indicated considerable agreement between the test 
results and the criterion measure. This correlation coefficient represents the con- 


current validity of Mr. Brown’s test. 
Concurrent validity may be defined as the 
related to some other current performance. In 


were related to ratings of performance. In ot 
be compared to teachers’ grades, other test results, and the like. Regardless of 


the criterion of success used, if validity is being determined by correlating test 
results with some other currently available information, concurrent validity is 
being obtained. 

The “Criterion” Problem. 
and predictive validity some criterion 


extent to which test performance ts 
the case of Mr. Brown, test results 
her instances the test results may 


In the determination of both concurrent validity 
of success is necessary. It will be recalled 


that Mr. Young used a comprehensive departmental examination as the criterion 
of success in his seventh-grade arithmetic class. Mr. Brown used his own ratings 
of the pupils’ study skills. In each instance the criterion of success was only 
partially suitable as a basis for test validation. Mr. Young recognized that the 
departmental examination did not measure all of the important learning out- 
comes that he aimed at in teaching arithmetic. There was not nearly enough 
emphasis on arithmetic reasoning, the interpretation of graphs and charts was 
sadly neglected and, of course, the test did not evaluate the pupils’ attitudes 
toward arithmetic (which Mr. Young considered to be extremely important). 
Likewise, Mr. Brown was well aware of the shortcomings of his ratings of 
pupils’ study skills. He sensed that some pupils “put on a show” when they knew 
they were being observed. In other instances he felt that some of the pupils 
were probably overrated on study skills because of their high achievement in 
class work, Despite these recognized shortcomings, Mr. Young and Mr. Brown 
found it necessary to use these criterion measures because they were the best 


criterion measures available. 
The plights of Mr. Young and Mr. 


success for the purpose of test validatio 
is one of the most difficult problems in validating a test. 


Brown in locating a suitable criterion of 
n are not unusual. The selection of a 


satisfactory criterion 


72 The Evaluation Process 


For most educational purposes, no adequate criterion of success exists. a 
which are used tend to be lacking in comprehensiveness and in most cases p 
vide results that are less stable than those of the test being validated. f i 

The lack of a suitable criterion for validating achievement tests has be agi 
implications for the classroom teacher. Since statistical types a ge a 
usually not be available, teachers will have to depend on proce na eae es 
analysis to assure test validity. This means carefully identifying the a apil 
of instruction, stating these objectives in terms of specific changes i ew 
behavior, and constructing or selecting evaluation instruments which pe ae 
torily measure the behavioral changes sought in pupils. Thus, content upil 
will assume a role of major importance in the teacher’s evaluation o P 
progress. 

Construct Validity 


The three types of validity thus far described are all concerned with ae 
specific practical use of test results. They help us determine how well ai, A 
represent the achievement of certain learning outcomes (content validity ; om 
well they predict a certain future performance (predictive validity), or adi 
well they estimate a certain current performance (concurrent validity). E a 
tion to these more specific and immediately practical uses, we Way US 
interpret test scores in terms of some general psychological quality. For 


* + š ic test. 
stance, rather than speak about a pupil’s score on a particular arithmetic , 
or how well it predicts success in mathem 


; ; he 
atics, we might want to infer that t 
Pupil possesses a ce 


A : d 
rtain degree of reasoning ability. This provides a broa 


s oA t 
general description of pupil behavior which has implications for many differen 
uses. 


Whenever we wish to inter 


x ho- 
pret test performance in terms of some psycl 
logical trait or quality, we ar 


e concerned with construct validity. A lest 
isa psychological quality which we assume exists in order to explain some Bg 
of behavior. Reasoning ability is a construct. When we interpret test scores a 
measures of reasoning ability, we are implying that there is a quality that can 


: srs h or 
be properly called reasoning ability and that it can account to some degree f 
performance on the test. Ver 


ifying such implications is the task of construct 
validation, 

Common example 
thinking, reading co 
is an obvious advan 
such psy: 


s of constructs are intelligence, scientific attitude, critical 
mprehension, study skills, and mathematical aptitude. There 
tage in being able to inter 
chological constructs, Each construct 
can be brought to bear in describin 
say a person is hi 


b 


pret test performance in terms of 


> we know what behaviors might 
ations. 

he extent to which test performance can 
ogical constructs. The process of deter- 
llowing steps: (1) identifying the con- 
performance; (2) deriving hypotheses 


The Validity of Evaluation Results 73 


regarding test performance from the theory underlying the construct; (3) veri- 
fying the hypotheses by logical and emperical means. For example, let us suppose 
that we wish to check the claim that a newly constructed test measures intel- 
ligence. From what is known about “intelligence,” we might make the following 


predictions: 


1. The test scores will increase with age (intelligence is assumed to increase 


with age until approximately age 16). 
2. The test scores will predict success in school achievement. 
3. The test scores will be positively related to teachers’ ratings of intelligence. 
4. The test scores will be positively related to scores on other “so-called in- 
telligence tests.” 
5. The test scores will discriminate betw 
such as “gifted” and “mentally handicapped.” 
6. The test scores will be little influenced by direct teaching. 


Each of these predictions, and others, would then be tested, one by one. If 
positive results are obtained for each prediction, the combined evidence lends sup- 
port to the claim that the test measures intelligence. If a prediction is not con- 
firmed, say the scores do not increase with age, We must conclude that either the 
test is not a valid measure of intelligence, or there is something wrong with our 
theory. As Cronbach and Meehl® have indicated, with construct validation both 
the theory and the test are being validated at the same time. 

As noted in the above illustration, there is no adequate single method of es- 
tablishing construct validity. It is a matter of accumulating evidence from many 
different sources. We may use content validity, predictive validity, and concur- 
rent validity as partial evidence to support construct validity, but none of them 
alone is sufficient, Construct validation depends on logical inferences drawn from 


a variety of types of indirect evidence. 
an seming construct validity. ou 
a rig the test was designed to m 
¿latus aps is of legitimate concern. i 

is test measures arithmetic reasoning, 


exi . 
: rece: test scores are influenced by computation: 
nd similar factors. Broadly conceived, construct validity is an attempt to 


account for the differences in test scores. Instead of asking. “Does this test meas- 

mg what the author claims it measures?” we are asking, “ Precisely what does 

this test measure? How can we most meaningfully interpret the scores in psy- 

chological terms?” The aim of construct validation is to identify the nature and 

strength of all factors influencing performance on the test.” 

a aes validity is of importance in all types of testing ects apti- 
» and personal-social development. When selecting any standardized test, 


een groups which are known to differ, 


r interest is not limited to the psychological 
easure. Any factor which might influence 
For example, although a test author 
we might rightfully ask to what 
al skill, reading ability, speed, 


6 
L. J. Cronbach and P. E. Meehl, “Construct Validity in Psychological Tests,” Psychologi- 


cal Bulletin, 52, 281-302, 1955. 


7 L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


74 The Evaluation Process 


we should note what interpretations are suggested for the test and then review 
the test manual to determine the total available evidence supporting these inter- 
pretations. The confidence with which we can make the proposed interpretations 
is directly dependent on the type of evidence presented. Also, if we suspect that 
test scores are influenced by factors other than those described in the manual 


(such as speed and reading ability), we should check these hunches with a 
suitable experiment of our own. 


FACTORS INFLUENCING VALIDITY 


Numerous factors tend to make test results invalid for their intended use. 
Some are rather obvious and easily avoided. No teacher would think of measur- 
ing knowledge of social studies with an English test. Nor would a teacher con- 
sider measuring problem-solving skill in third-grade arithmetic with a test 
designed for sixth graders. In both instances the test results would be obviously 
invalid. The factors influencing validity are of this same general nature but 
much more subtle in character. For example, a teacher may overload a social 
studies test with items concerning historical facts and thus it is less valid as a 
measure of achievement in social studies. Or a third-grade teacher may select 
appropriate arithmetic problems for his pupils but write directions that only 
the better readers are able to understand clearly. The arithmetic test then be- 
comes a reading test which invalidates the results for their intended use. This 
is the nature of some of the more subtle factors influencing validity. These are 
the factors for which the teacher should be alert, whether constructing classroom 
tests or selecting standardized tests. 


Factors in the Test Itself 


A careful examination of test items will indicate whether the test seems to 
measure the subject-matter content and the mental functions that the teacher is 
interested in testing. However, any of the following factors can prevent the 


test items from functioning as intended and thereby lower the validity of the 
test results. 


l. Unclear directions. Directions which do not clearly indicate to the pupil 
how to respond to the items, whether it is permissible to guess, and how to record 
the answers will tend to reduce validity. 

2. Reading vocabulary and sentence structure too difficult. Vocabulary and 
Sentence structure which is too complicated for the pupils taking the test will 
result in the test measuring reading comprehension and aspects of intelligence 
rather than the aspects of pupil behavior that the test is intended to measure. 

3. Inappropriate level of difficulty of the test items. Items which are too easy 


or too difficult will not provide reliable discriminations among pupils and will 
therefore lower validity. 


4. Poorly constructed test items. 
clues to the answer will tend to mea 
as well as the aspects of pupil beh 


Test items which unintentionally provide 
sure the pupils’ alertness in detecting clues 
avior that the test is intended to measure. 


The Validity of Evaluation Results 75 


5. anys ; - ; 
Test items which are inappropriate for the outcome being measured. At- 


tempti ae š 
pting to measure understandings, thinking skills, and more complex be- 
iate only for measuring factual 


ee with test forms which are appropri 

nowledge will invalidate the results. 

Pd on, ai defect in the construction of the test which prevents the test 

a ss ens in harmony with their intended use will contribute to 

chayteps é ity of the measurement. Much of what is written In the following 
s directed toward helping teachers improve the validity of the results 


obtai ; sane 
ined with classroom tests and other evaluation instruments. 


F ‘oni š 

unctioning Content and Teaching Procedures 

functioning content of test items can- 
form and content of the test. For 
measure arithmetical reasoning if 
Js-have been taught. 


By ne of achievement testing, the 
e Aa merely by examining the 
sania Ne ollowing item may appear to 

without reference to what the pup! 


If a 40’ pipe is cut so that the shortest piece is 7 as long as the longest 


piece, what is the length of the shortest piece? 
e solution to this particular problem 


Hoy p: : 
vever, if the teacher has taught th 
ures no more than memorized 


ne pang the test, the test item now measur me 
plex en Similarly, tests of understanding, critical thinking, and other com- 
ins s. outcomes are valid measures 1!n these areas only if the test 1tems 
to the a5 intended. If the pupils have previously been taught the solutions 
steps agi see problems included in the test, or have been taught mechanical 

r obtaining the solutions, such tests can no longer be considered valid 


instru 
ments for measuring the more complex mental processes. 


Fa . 
ctors in Pupils’ Responses 
e due to personal factors influencing 


In s š 
ome instances, invalid test results ar 
her than to any shortcomings in the 


the š 

upi 3 
est š pil’s response to the test situation rat 
Nstrument. Pupils may be hampered by emotional disturbances which 


i 
due with their test performance. Some pupils are frightened by the test 
to put L. thereby unable to respond normally. Still others are not motivated 
Pupils’ their best effort. These and other factors which restrict and modify 
test men in the test situation will obviously lower the validity of the 
peach factor which influence r a 
to test ite set is a consistent tendency to follow a certain pattem in responding 
now ie For example, some persons will respond “true when they do not 
“false.” foe as to a true-false item, while other persons will tend to mark 
the pe in with a large number of true statements will consequently be to 
type. Altho ge of the first type of person and to the disadvantage of the second 
ugh some response sets, such as the one illustrated, can be offset by 


s test results is that of response set.Š 


LEG 
ronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


76 The Evaluation Process 


careful test construction procedures (e.g., including an equal number of 
true and false statements in the test) other response sets are more difficult o 
control. Typical of response sets in this latter category are the tendency to en 
for speed rather than accuracy, the tendency to gamble when in doubt, an is 
use of a particular style in responding to essay tests. These response sels Hr 
the validity of the test results by introducing into the test score factors whic 
are not pertinent to the purpose of the measurement.” 


Nature of the Group and the Criterion 


Validity is always specific to a particular group. An arithmetic test based on 
story problems, for example, may measure reasoning ability in a slow group; 
anà a combination of simple recall of information and computational skill in 
a more advanced group. Similarly, scores on a science test may be accounted for 
largely by reading comprehension in one group and by knowledge of facts in 
another. What a test measures is influenced by such factors as age, sex, ability 
level, educational background, and cultural background. Thus, in appraising 
reports of test validity included in test manuals, or other sources, it is important 
to note the nature of the validation group. How closely it compares in significant 
characteristics to the group of pupils we wish to test, determines how applicable 
the information is to our particular group. 

In evaluating validity coefficients, it is also necessary to consider the nature 
of the criterion used. For example, 


scores on a mathematics aptitude test are 
likely to provide a more accurate pri 


ediction of achievement in a physics course 
in which quantitative problems are stressed than in one where they play only 8 


minor role. Likewise, we can expect scores on a critical thinking test to correlate 
more highly with grades in social studies courses which emphasize critical 
thinking than in those which depend largely on the memorization of factual 
information. Other things being equal, the greater the similarity between the 


behaviors measured by the test and the behaviors represented in the criterions 
the higher the validity coefficient. 
Since v 


alidity information varies with the nature of the group tested and with 
the comp 


osition of the criterion measures used, published validation data should 


be considered as highly tentative. Whenever possible, the validity of the test 
results should be checked in the specific local situation. 
This discussion of f 


actors influencing the validity of test results should make 
clear the pervasive and functional nature of the concept validity. In the final 
analysis the validity of test results is based on the extent to which the behavior 
elicited in the testing situation is a true representation of the behavior being 
evaluated. Thus, anything in the construction or the administration of the test 
which causes the test results to be unrepresentative of the characteristics of the 
Person tested contributes to lower validity. In a very real sense, then, it is the 
user of the test who must make the final judgment concerning the validity of the 
test results. He is the only one who knows how well the test fits his particular use 


°L, J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960)- 


The Validity of Evaluation Results “7 


how i iti 
well the testing conditions were controlled, and how typical the responses 
were to the testing situation. 


SUMMARY 


ore important quality to consider when selecting or constructing an 
ea scp aap is validity. This refers to the extent to which the evaluation 
validity a t e particular uses for which they are intended. In interpreting 
resili keia it is important to keep in mind that validity refers to the 

er than to the instrument, that its presence is a matter of degree, 


and TE N : 
thatitis always specific to some particular use. 


There are four basic types of validity. Content validity refers to the extent to 
t-matter content and 


whi ° ; 
ch a test measures a representative sample of the subjeci 
mportant in achieve- 


waww changes under consideration. It is especially i Niev 
WARS ing and is determined by logical analysis oÍ test content. Predictiue 
validity is concerned with the extent to which test performance is accurate in 
Predicting some future performance. This type of validity can be reported by 
means of a correlation coefficient called a validity coefficient or by means of an 
expectancy table. It is of special significance in all types of aptitude testing, but 
is pertinent whenever test results are used to make specific predictions. Con- 
current validity indicates the extent to which test performance is related to some 
other current performance. It is also reported by means of a validity coefficient. 
Concurrent validity is usually determined where a test is being considered as a 
substitute for a more time-consuming procedure. Construct validity refers to the 
extent to which test performance can be interpreted in terms of certain psy- 
chological constructs. The process of construct validation involves identifying 


and clarifying the factors which influence test scores so that test performance 
,, This involves the accumulation of evidence 


can be interpreted most meaningfully. ra. 
from a variety of different studies. All of the other types of validity may be used 
as partial support for construct validity, but it is the combined evidence from 
all sources that is important. The more complete the evidence, the more con- 
fident we are concerning the psychological qualities measured by the test. 

A number of factors tend to influence the validity of test results. Some of 
these influences can be found in the test instrument itself, some in the relation 
of teaching to testing, some in the atypical responses of pupils to the test situa- 
tion, and still others in the nature of the group tested and in the composition of 
the criterion measures used. A major aim in the construction, sélection, and use 
of tests, and other evaluation instruments, is to control those factors which have 
an adverse affect on validity and to interpret evaluation results in accordance 


with what validity information is available. 


SUGGESTIONS FOR FURTHER READING 
2nd edition, New York: Macmillan, 1961. Chapter 6: 


Anastasi, Anne. Psychological Testing. > a 
“Utilization of Validity Data.” 


“Methods of Determining Validity.” Chapter T: 


78 The Evaluation Process 


Bauernfeind, R. H. Building a School Testing Program. Boston: Houghton ie p 
Chapter 4: “The Concept of Correlation.” Chapter 5: “The Concept of Test Vali ar 
Cronbach, L. J. Essentials oj Psychological Testing. New York: Harper & Row, 1960. 
ter 5: “Validity.” k. 
Cronbach, L. J. “Validity” Encyclopedia of Educational Research. 3rd edition, New York: 
Macmillan, 1960. Pages 1551-1555. . d Psy- 
Ebel, R. L. “Obtaining and Reporting Evidence on Content Validity,’ Educational an 
chological Measurement, 16, 269-282, 1956. 


=< i and 
Lennon, R. T. “Assumptions Underlying the Use of Content Validity, Educational 
Psychological Measurement, 16, 294-304, 1956. 


Test Bulletins* 


ho- 
Wesman, A. G. Better Than Chance. Test Service Bulletin, No. 45, New York: The Psyc 
logical Corporation, 1953. 


* Also see Footnote 5 in this chapter. 


“Chapter 5 
other characteristics 
desired in 


evaluation 
procedures 


Next to validity, reliability is the most important characteristic of evalu- 


ation results. . . . Reliability (1) provides the consistency which makes 


validity possible, and (2) indicates how much confidence we can place 
he evaluation procedures is, of 


in Š 
i our results. . . . The practicality of t 
ou 

rse, also of concern to the busy classroom teacher. 


Me ane 4 it was emphasized that validity is the most important considera- 
ooo e selection and construction of evaluation procedures. First and fore- 
Peet se evaluation results to serve the specific uses for which they are 

. Next in importance is reliability, and following that is a host of prac- 


tical features which can best be classified under the heading of usability. 


RELIABILITY 


of measurement. That is, to how consistent 


ae refers to the consistency 
Supose 1 or other evaluation results are from one measurement to another. 
ae a p instance, that Miss Jones had just given an achievement test to 
e i aa similar would the pupils’ scores have been had she tested them 
"res oh ii or next week? How would the scores have varied had 
in tng ni erent ~ of equivalent items? If it was an essay test, how 
heare det wna hate been changed had a different teacher scored it? 
ly ae a | questions with which reliability is concerned. Test scores 

jmited measure of behavior obtained at a particular time. 


79 


80 The Evaluation Process 


; is, able 

Unless the measurement can be shown to be reasonably 15 
to be generalized) over diflerent occasions or over diflerent samples 
behavior, little confidence can be placed in the results. ee 

On the other hand, we cannot expect test results to be per ae ; ohih: may 
There are numerous factors other than the quality being measure¢ in 
influence test scores. If a single test is administered to the same gro sss. 
close succession, some variation in scores can be expected due to “oem 
fluctuations in memory, attention, effort, fatigue, emotional ween s ae 
and similar factors. With a longer time period between tests, adc w: om i 
tion in scores may be caused by intervening learning wg, "ena dees 
health, forgetting, and less comparable testing conditions. If = pe ee 
sample of items in the second test, still another factor is likely hes nik unpas 
results, Individuals may find one test easier than the other nee =a Such 
to contain more items on specific topics with which they are fami ia a sek 
extraneous factors as these introduce a certain amount of error arine 
scores. Methods of determining reliability are essentially means of | "ike mate 
how much error is present under different conditions. In general, tos ee 
consistent our test results are from one measurement to another, the less 
present and, consequently, the greater the reliability. ae Be ñu 

The meaning of reliability, as applied to testing and evaluation, ca 
ther clarified by noting the following general points: 


nd 
1. Reliability refers to the results obtained with an evaluation —n wa Ë 
not to the instrument itself. Any particular instrument may have a ees in 
different reliabilities, depending on the group involved and the — Ahe 
which it is used. Thus it is more appropriate to speak of the reliability o R 
test scores,” or of “the measurement,” than of “the test,” or “the instrumen A a 
2. A closely related point is that an estimate of reliability always age 
particular type of consistency, Test scores are not reliable in general. ent 
are reliable (or able to be generalized) over different periods of nie ible 
different samples of questions, over different raters, and the like. It is Pohe 
for test scores to be consistent in one of these respects and not in another. He 
appropriate type of consistency in a particular case is dictated by the use a fe 
made of the results. For example, if we wish to know what individuals wi seas 
like at some future time, constancy of scores is highly important. On the ot 


o 
7 Soy. sro é ent t 
hand, if we want to measure an individual’s shifts in anxiety from mom 

moment, we shall need a measur 


to obtain the information we de 
different analyses of consistency 


can only lead to erroneous int 
3. Reliability is 


e which lacks constancy over occasions in order 
sire. Thus, for different interpretations we ne 
. Treating reliability as a general characteris 
erpretations. oe 
a necessary but not a sufficient condition for validity. A t€ 


n sh- 
TR L. Thorndike, “Reliability,” Educational Measurement, ed. E. F. Lindquist (Was! 
ington, D.C.: American Council on Education, 1951). 

z L..J. Cronbach, E. 


Ssentials of Psychological Testing (New York: Harper & Row, 1960). 


Other Characteristics Desired in Evaluation 81 


which provides totally inconsistent results cannot possibly provide truthful infor- 
mation about the behavior being measured. On the other hand, highly con- 
sistent test results may be measuring the wrong thing or may be used in ways 
us, low reliability can be expected to restrict the degree 
of validity that is obtained, but high reliability provides no assurance that a 
satisfactory degree of validity will be present. In short, reliability merely pro- 
vides the consistency which makes validity possible. 

Although a highly reliable measure may have little or no validity, a measure 
which has been shown to have a satisfactory degree of predictive validity must 
of necessity possess sufficient reliability. Thus. where we are interested only in 
predicting a specific criterion, reliability will be of little concern if predictive 
validity is satisfactory." 

4. Unlike validity, reliability is strictly a statistical concept. Logical analysis 
of a test will provide little evidence concerning the reliability of the scores. The 


test must be administered, one or more times, to an appropriate group of persons 
the results determined. This consistency may be expressed 
up or in terms 


that are inappropriate. Th 


and the consistency of 
in terms of shifts in the relative standing of persons in the gro 
of the amount of variation to be expected ina specific individual’s score. Con- § 
sistency of the first type is reported by means of a correlation coefficient called Å „» 
a reliability coefficient. Consistency of the second type is reported by means of 

the standard error of measurement. Both methods of expressing reliability are 
widely used and should be understood by persons responsible for interpreting 

test results.* 


Determining Reliability by Correlation Methods 


In determining reliability it would be desirable to obtain two sets of measures 
under identical conditions and then to compare the results. This procedure is 
impossible, of course, since the conditions under which evaluation data are 
obtained can never be identical. As a substitute for this ideal procedure several 
methods of estimating reliability have been introduced. The methods are similar 
in that all of them involve correlating two sets of data, obtained either from 
the same evaluation instrument or from equivalent forms of the same procedure. 
The correlation coefficient used to determine reliability is calculated and inter- 
preted in the same manner as that used in determining predictive and concur- 
rent validity. The only difference between a validity coefficient and a reliability 
coefficient is that the former is based on agreement with an outside criterion 
and the latter is based on agreement between two sets of results from the same 


procedure. 


sting (New York: Harper & Row, 1960). 


3 L. J. Cronbach, Essentials of Psychological Te: 
and National Council on Measurements 


4 American Educational Research Association 
Used int Education, Teehical Recommendations for Achievement Tests (Washington: Na- 
tional Education Association, 1955). American Psychological Association, “Technical Recom- 
mendations for Psychological Tests and Diagnostic Techniques.” Supplement to the Psycho- 
logical Bulletin, 51, 1954. 


82 The Evaluation Process 


The chief methods of estimating reliability are shown in Table 5.1. Note that 
different types of consistency are determined by the diflerent methods—con- 
sistency over a period of time, consistency over different forms of the instru- 
ment, and consistency within the instrument itself. The reliability coefficient 
resulting from each method is given a name which characterizes the type of con- 
sistency being investigated. Each of these methods of estimating reliability will 
be considered in further detail as we proceed. Although the methods will be dis- 
cussed mainly with reference to testing procedures, they are, of course, also 
applicable to other types of evaluation techniques. 


Table 5.1 


METHODS OF ESTIMATING RELIABILITY 


Type of 
Reliability Coefficient Procedure I 
Test-retest Coefficient of stability Give the same test twice to the same 
method group with any time interval be- 
tween tests from several minutes to 
several years 

Equivalent-forms Coefficient of equivalence Give two forms of the test to the same 

method group in close succession 
(Test-retest with Coefficient of stability and Give two forms of the test to the same 
equivalent forms) equivalence group with increased time interval 


between forms 
Split-half Coefficient of internal con- Give test once. Score two equivalent 


method sistency halves of test (e.g., odd items and 
even items); correct reliability co- 
efficient to fit whole test by Spear- 
man-Brown formula 
Kuder-Richardson Coefficient of internal con- Give test once; score total test and 
method sistency apply Kuder-Richardson formula 
Test 


-Retest Method. To estimate reliability by means of the test-retest 
method the same test is administered twice to the same group of pupils with a 
given time interval between the two administrations of the test. The resulting 
test scores are correlated and this correlation coefficient is called a coefficient 
of stability because it indicates how stable the test results are over the given 
period of time. If the results are highly stable, those pupils who are high on 
one administration of the test will tend to be high on the other administration 
of the test and the remaining pupils will tend to stay in their same relative 
Positions on both administrations of the test. Such stability would be indicated 
by a large correlation coefficient. It will be recalled from our previous discussion 
of correlation coefficients that a perfect positive relationship is indicated by 1.00 


and a zero relationship by .00. Coefficients of stability in the .80’s and .90’s are 
commonly reported for standardized tests of 


aptitude and achievement over 
occasions within the same year. 


Other Characteristics Desired in Evaluation 83 


. One important factor to keep in mind in interpreting coefficients of stability 
is the time interval between tests. If this time interval is short, say a day or two, 
the constancy of the results will be inflated by the fact that pupils will remember 
some of their answers from the first test to the second. If the time interval is 
long, say about a year, the results will not only be influenced by the instability 
of the testing procedure but also by actual changes in the pupils over that period 
of time. In general, the longer the time interval between test and retest the more 
the results are influenced by changes in the pupil characteristic being measured, 


and the lower the coefficient of stability. 

What time interval between tests is most preferable will depend largely on 
the use to be made of the results. If we are trying to predict from ninth-grade 
test scores whether a boy is likely to succeed in college, stability over a several 
year period is quite important. If we are trying to predict whether he will suc- 
ceed in this year’s algebra course, stability over any period longer than a few 
months is quite unimportant. Thus, for some decisions we are interested in 
Stability coefficients based on a long interval between test and retest and, for 
others, stability coefficients based on a short interval may be sufficient. The 
important thing is to seek evidence of stability which fits the particular inter- 
pretation to be made. 

Most teachers will not find it possible to compute test-retest reliability coeffi- 


cients for their own classroom tests. However. in choosing standardized tests 
the Stability of the scores serves as one important criterion. The test manual 
een tests 


should provide coefficients of stability, indicating the time interval betw 
and any unusual experiences the group members might have had between test- 
ings. Other things being equal (such as validity), we shall favor the test whose 
Scores have been shown to possess the type of stability we need to make sound 
decisions, 

Information concerning the stability of test scores also has implications for the 
Use of test results from selo] records, and for the frequency with which retesting 
1S needed. We know, for example, that first-grade scholastic aptitude test scores 
are fairly stable over occasions within the same year, but relatively unstable 
over a period of several years. Thus, we can expect to use such results in deter- 
mining readiness for first-grade work, but should not rely on them for estimates 
oi learning ability in the later elementary grades. For this use, a second test 
walt need to be administered at the beginning of the later elementary period. 
Similarly, when using any test score from permanent records, one should check 
the date of testing and the stability data available to determine if the results 
are still dependable. Where doubt exists and the decision is important, retesting 
Is in or der. 

Equivalent-Forms Method. Estimating reliability by means of the equiva- 
ent-forms method involves the use of two different but equivalent forms of 
the Same test. The two forms of the test are administered to the same group of 
Pupils in close succession and the resulting test scores are correlated. This cor- 
relation coefficient is called a coefficient of equivalence, and it indicates the 


84 The Evaluation Process 


degree to which both forms of the test are measuring the same aspects of pupil 
behavior. 

It should be noted that the coefficient of equivalence tells us nothing about 
the stability of the pupil characteristic being measured. This coefficient primarily 
reflects the extent to which the test represents an adequate sample of the char- 
acteristic being measured. In achievement testing, for example, there are thou- 
sands of questions that might be asked in a particular test. However, due to 
time limits and other restricting factors, only a limited number of the possible 
test questions can be used. If the questions included in the test provide a repre 
sentative sample of the possible questions in the area, the test may be said to 
provide a reliable measure of the content in that area. The easiest way to estl- 
mate if a test measures a representative sample of the content is to construct 
two forms of the test and correlate the results. A high correlation indicates 
that both forms are measuring the same content and therefore are probably 
reliable samples of the general area of content being measured. 

The equivalent-forms method of estimating reliability does away with the 
troublesome problem of selecting a proper time interval between tests as nS 
necessary with the test-retest method. However, the need for two equivalent 
forms of the test restricts its use almost entirely to standardized testing. Here 
it is widely used, since most standardized tests have two or more forms available. 
In fact, a teacher should look with suspicion upon any standardized test which 
has two forms available and does not report a coefficient of equivalance for them. 
The comparability of the results of the two forms cannot be assumed unless 
evidence concerning their equivalence is presented. 

The equivalent-forms method is sometimes used with a time interval betwee? 
the administration of the two forms of the test. Under these conditions, the 
resulting reliability coefficient is called a coefficient of stability and equivalence: 
This is the most rigorous test of reliability because it includes all possible sources 
of variation in the test scores. The stability of the testing procedures, the con- 
stancy of the pupil characteristic being measured, and the representativeness of 
the sample of tasks included in the test are all taken into account. Consequently, 
this is generally recommended as the soundest procedure for estimating the relia- 
bility of test scores. As with the ordinary test-retest method, the reliability coeffi- 
cient must be interpreted in light of the time interval between the two forms 
of the test. For longer time periods, we should ordinarily expect smaller relia- 
bility coefficients. 

Split-Half Method. The reliability of test scores can also be estimated from 
a single administration of a single form of a test. The test is administered to 8 
group of pupils in the usual manner and then is divided in half for scoring pur- 
poses. To split the test into halves which are most equivalent, the usual procedure 
is to score the even-numbered items and the odd-numbered items separately. 
This provides two scores for each pupil which, when correlated, provides & 


coefficient of internal consistency. This coefficient indicates the degree to which 
the two halves of the test are equivalent. 


a =... 


Other Characteristics Desired in Evaluation 85 
rmined by correlating the 
f the scores based on the 
applied. This formula 


As noted, the above reliability coefficient is dete 
ie of two half-tests. To estimate the reliability o 
full length test the Spearman-Brown formula is usually 
is as follows: 


Reliability 2 X Reliability on % test 
ility on full test = + + Reliability on 1⁄4 test 


The simplicity of the formula can be seen in the following example where the 


correlation coefficient between the two halves of a test is .60. 
2X 60 1.20_ 75 
1+.60— 1.60 


then, provides an estim 


Reliability on full test = 
This correlation coefficient of .75, ate of 2 s 


full test where the half-tests correlated .60. 
T split-half method is similar to the equivalent-forms method in that it 
ates the extent to which the sample of test items is a representative sample 
of the content being measured. A high correlation between scores on the two 
halves of a test denote the equivalence of the two halves and consequently the 


adequacy of the sampling. However: like the equivalent-forms method, it tells 
nothing about changes in the individual from one time to another. 
Kuder-Richardson Method. Another method of estimating the reliability 
k Fa scores from a single administration of a single form of a test is by means 
i ormulas such as those developed by Kuder and Richardson.” These formulas 
ve provide a coefficient of internal consistency but they do not require split- 
His the test in half for scoring purposes- One of the formulas, called the Kuder- 
oo sah Formula 20, is based on the proportion of persons passing each item 
cutnb he standard deviation of the total test scores. The computation is rather 
: ersome, unless information is already available concerning the proportion 
age each item, but the resulting coefficient is equal to the average of all pos- 
e split-half coefficients for the group tested. 
A less accurate but simpler formula to compute is the Kuder-Richardson 
any test which has 


F. 
oe 21. This formula can be applied to the results of 
een scored on the basis of the number of correct answers. A modified version of 


the formula? nan 
en K M(x — M) 
Reliability Coefficient (KR21) = eS ( == ua) 


where K — 
° K = the number of items in the test 
M = the mean (arithmetic average) of the test scores 
s = the standard deviation of the test sco 
surement, ed. E. F. Lindquist (Wash- 


res 


5 R. l 
ington EÉ Thorndike, “Reliability,” Educational Mea 
6 Sta .C.: American Council on Education, 1951). I 
comp andard deviation is a measure of the spread of scores. See the appendix for method of 
uting. 


J t; Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


86 The Eualuation Process 


This formula will yield approximately the same results as Kuder-Richardson 
Formula 20, but in some cases the reliability coefficient may be smaller.5 Its 
chief advantage is the ease with which it can be applied. 

Kuder-Richardson estimates of reliability assume that the items in the test 
are homogeneous. That is, that each test item measures the same quality or 
characteristic as every other. Where this assumption is justified, the reliability 
coefficient will be similar to that provided by the split-half method. If homo- 
geneity is lacking, as in an achievement test which measures different types of 
learning outcomes, an underestimate of split-half reliability will result.” 

The simplicity of applying the split-half method and the Kuder-Richardson 
method has led to their widespread use in estimating reliability. However, such 
internal consistency procedures have limitations which restrict their value. First, 
they are not appropriate for speeded tests. That is, for tests with time limits 
which prevent pupils from attempting every item. Where speed is a significant 
factor in the testing, the reliability coefficients will be inflated to an unknown 
degree. This poses no great problem in estimating the reliability of test scores 
from teacher-made tests, since these are usually power tests. In the case of 
standardized tests, however, time limits are seldom so liberal that all pupils 
complete the test. Thus, coefficients of internal consistency reported in test man- 
uals should be generally disregarded unless evidence is also presented that speed 
of work is a negligible factor. For speeded tests, reliability obtained by the test- 
retest or equivalent-forms methods should be sought. 

A second limitation of internal consistency procedures is that they do not 
indicate the constancy of pupil response from day to day. In this regard, they 
are similar to the equivalent-forms method without a time interval. Only test- 
retest procedures indicate the extent to which test results are generalizable over 
different periods of time. 

Comparing Correlation Methods. As noted in our previous discussion: 
each of the methods for estimating reliability provides different information 
concerning the consistency of test results. A summary of this information is 
presented in Table 5.2. This table makes clear the fact that most methods are 
concerned with only one or two types of consistency sought in test results. The 
test-retest method, without a time interval, takes into account only the consistency 
of the testing procedures and short-term constancy of response. If a time interval 
is introduced between the tests, the constancy of the characteristics of the pupil 
from day to day is also included. However, neither of the test-retest procedures 
provides information concerning the consistency of results over different samples 
of items, since both sets of scores are based on the same test. 

The equivalent-forms method without a time interval. the split-half method, 
and the Kuder-Richardson method all take into account the consistency of test- 
ing procedures and the consistency of results over different samples of items. 


8 L. J. Cronbach, E. 
"R. L. Thorndike, 
ington, D.C.: 


ssentials of Psychological Testing (New York: Harper & Row, 1960). 
“Reliability,” Educational Measurement, ed. E. F. Lindquist (Wash- 
American Council on Education, 1951). 


Other Characteristics Desired in Evaluation 87 


Table 5.2 


TYPE OF CONSISTENCY INDICATED BY EACH OF THE 
METHODS FOR ESTIMATING RELIABILITY 


Type of Consistency 


ai of Consistency Constancy Consistency Over 
Stimating of Testing of Pupil Different Samples 
Reliability Procedure Characteristics of Items 


Test-retest 
(immediate) X 
Test-retest 
(time interval) 
Equivalent-forms 
(immediate) 
Equivalent-forms 
(time interval) 
Split-half 
Kuder-Richardson 


tal 
>. 


xxx w 
xx xw 


ao < 1 = 
Short-term constancy of response is reflected in immediate retest, but day-to-day stability 


'S not shown. 


Only the equivalent-forms method with an intervening time interval between 
tests takes into account all three types of consistency. This is the reason the 
Coefficient of stability and equivalence, obtained with this method, is generally 


Ti a :1: 
egarded as the most useful estimate of test reliability. 


The Standard Error of Measurement 


over and over again on the same test, we 


m If it were possible to test a pupil mme SE E 
ould find that his scores would vary somewhat. The amount of variation in his 


test scores would be directly related to the reliability of the testing procedures. 
Low reliability would be indicated by large variations in the pupil’s test scores. 

igh reliability would be indicated by little variation from one testing to another. 
Although it is impractical to administer a test many times to the same pupils, 
ìt is Possible to estimate the amount of variation to be expected in test scores. 


T estimate is called the standard error of measurement. 
est manuals usually give the standard error of measurement. Thus all we 
erpreting individual test scores. 


ne š M 
ed to do is to take it into account when int ae 
or example, let us assume that we have just administered an intelligence test 


t š 
re a class and the results indicate that Mary Smith has an IQ of 97. We note 
n the test manual that the standard error of measurement 1S 5. What does this 


5 mean with regard to Mary Smith’s 1Q? In general, it indicates the amount of 
i that must be taken into consideration in interpreting Mary Smith’s IQ 
= e. More specifically, it provides the limits within Which we can reasonably 

Pect to find Mary Smith’s “true” IQ score. A “true” score is one that would 
© obtained if the test were perfectly reliable. If Mary Smith were tested 


88 The Evaluation Process 


repeatedly under identical conditions 68 per cent oÍ her obtained scores would 
fall within 1 standard error of her “true” score, 95 per cent would fall within 
2 standard errors, and 99 per cent would fall within 3 standard errors.!° For 
practical purposes, these limits may be applied to Mary Smith’s obtained score 


of 97 to give us the following ranges within which we could be reasonably sure 
to find her “true” score. 


Number of Level of Score Units to Range of 
Standard Errors Confidence Apply to Mary's Scores 
IQ Score of 97 
68% 5 92-102 
2 95% 10 87-107 
99% 15 82-112 


On this basis, we can be nearly certain that Mary Smith’s “true” IQ score is 
somewhere between 82 and 112 (within 3 standard errors). If we are willing 
to accept the 95 per cent level of confidence we can estimate that her “true 
IQ score is somewhere between 87 and 107 (within 2 standard errors). Recog- 
nizing this amount of error in Mary Smith’s score, we would not be surprised 
if on a later intelligence test she received a score of 90, a score of 103, or any 
other score within the above ranges. Such variations can be accounted for by 
errors of measurement alone. The standard error makes us wary of attributing 
significance to minor fluctuations in test scores. 

The standard error of measurement makes it clear that a test score should 
be interpreted as a “band of scores” rather than as a specific score. With a 
large standard error the band of scores is large and we have less confidence 
in our obtained score. If the standard error is small the band of scores is small 
and we have greater confidence that our obtained score is a true measure of 
the characteristic. Viewing a test score as a “band of scores” makes it possible 
to interpret and use test results more intelligently. Apparent differences in test 
scores, between individuals and for the same individual over a period of time, 
frequently disappear when the standard error of measurement is considered. 
A teacher or counselor who has an acute awareness of the standard error of 
measurement finds it impossible to be dogmatic in interpreting minor differ- 
ences in test scores. 

The relationship between the reliability coefficient and the standard error of 
measurement can be seen in Table 5.3. This table presents the standard errors 
of measurement for various reliability coefficients and standard deviations.’ 
It will be noted that as the reliability coefficient increases, for any given stand- 
ard deviation, the standard error of measurement decreases. Thus, high relia- 
bility coefficients are associated with small errors in specific test scores and low 
reliability coefficients are associated with large errors. 


19 These percentages are based o 


n the normal curve. See Chapter 14 for description of the 
normal curve, 


™ Standard deviation is a measure of th 


e spread of scores. See appendix for method of 
computing. 


Other Characteristics Desired in Evaluation 89 


Table 5.3 


STANDARD ERRORS OF MEASUREMENT FOR GIVEN VALUES OF 
RELIABILITY COEFFICIENT AND STANDARD DEVIATION™ 


Reliability Coefficient 


SD 95 90 85 80 75 70 
30 6.7 9.5 11.6 13.4 15.0 16.4 
28 6.3 8.9 10.8 12.5 14.0 15.3 
26 5.8 8.2 10.1 11.6 13.0 14.2 
24 5.4 7.6 9.3 10.7 12.0 13.1 
22 49 7.0 8.5 98 11.0 12.0 
-20 4.5 6.3 7.7 8.9 10.0 11.0 
18 4.0 5.7 7.0 8.0 9.0 9.9 
16 3.6 5.1 6.2 12 8.0 8.8 
14 3.1 44 5.4 6.3 7.0 TT 


12 27 3.8 4.6 54 6.0 6.6 


8.2 39 45 5.0 5.5 


10 2.2 
8 18 25 3.1 3.6 4.0 4.4 
6 13 19 2.3 2.7 3.0 33 
4 9 13 15 18 2.0 2.2 
2 4 6 8 9 1.0 1.1 


ement) = SD V 1 — r where SD is the 
ficient. Reprinted from J. E. 
w York: The Psy- 


rror of measurement, Table 
fact this is the purpose for 
do to obtain an estimate of 


If 
53 a test manual does not report the standard e 
which 4 be used to estimate the standard error. In 
the table was developed. All one needs to 


Secon 

is B column (.90) until you come to the row 
test, + Mie example is similar to data commonly reported for group intelligence 
He approximately the same as that used in 


he resulting standard error is 
h and would, of course, be interpreted 


our earlier ; 
i arlier illustration with Mary Smit 
In th y 

€ same manner. 


hi 
B are several precautions to be ke; 
ate the standard error of measurement. First, the reliability coefficient and 


s 
ey deviation must be based on the same group of persons. Second, enter- 
ok a with the reliability coefficient and standard deviation nearest to 
e 2 e manual gives you only an approximation of the standard error of 
standard ent. Third, the table does not take into account the fact that the 
error of measurement varies slightly at different score levels. Within 


pt in mind when using Table 5.3 to 


90 The Evaluation Process 


these limitations, Table 5.3 provides a simple and quick method for estimating 
the standard error of measurement and an approximation accurate enough for 
most practical applications of test results.!2 ; 

The standard error of measurement has two special advantages as a means o 
estimating reliability. First, the estimates are in the same units as the test = = 
This makes it possible to directly indicate the margin of error to allow or 
when interpreting individual scores. Second, the standard error is likely ie 
remain fairly constant as you go from group to group. This is not true of the 
reliability coefficient which is highly dependent on the spread of scores in the 
group tested. Since the groups on which reliabilities are reported in test manuals 
will always differ somewhat from the group to be given the test, the greater 
constancy of the standard error of measurement has obvious practical value. The 
main difficulty encountered with the standard error occurs when we want to 


= jabilit 
compare two tests which use different types of scores. Here the reliability 
coefficient is the only suitable measure. 


Factors Influencing Reliability 


A number of factors have been shown to affect reliability. If sound conclu- 
sions are to be drawn, these factors must be considered when interpreting 
reliability coefficients. We have already seen, for example, that speeded ist 
will provide a spuriously high reliability coefficient with the internal consistency 
methods of estimating reliability. We have also noted that test-retest reliability 
coefficients are influenced by the time interval between testings, with shorter 
time intervals resulting in higher reliability coefficients. Thus, in comparing 
the reliability coefficients of two or more tests we must take such factors pa 
account. Although we might want to favor the test with the highest reliability 
coefficient, we would not do so if we recognized that the reported coefficient was 
inflated by factors irrelevant to the consistency of the measurement procedure. 
Similarly, we might discount the difference between reliability coefficients 
reported for two different tests if the conditions under which they were obtained 
favored the test with the highest reliability coefficient. 

Consideration of the factors influencing reliability will not only help us inter- 
pret the reliability coefficients of standardized tests more wisely, but should also 
aid us in constructing more reliable classroom tests. Though teachers seldom 
find it profitable to calculate reliability coefficients for the tests they construct, 
they can and should take cognizance of the factors influencing reliability to 
maximize the reliability of their own classroom tests. 

Length of Test. In general, the longer the test the higher the reliability- 
This is due to the fact that a longer test will prov 


of the behavior being measured and the scores are 
chance factors such as guessing. 


pupils to spell one word. The re 


ide a more adequate sample 
apt to be less influenced by 
Suppose, to measure spelling ability, we asked 
sults would be patently unreliable. Pupils who 


"° J. E. Doppelt, How Accurate Is a Test Score? Test Service Bulletin No. 50 (New York: 
The Psychological Corporation, 1956). 


Other Characteristics Desired in Evaluation 91 


a hrs ss the word would be perfect spellers and pupils who could not 
wedi L ag failures. If we happened to select a difficult word most pupils 
sellers The : e — was an easy one most pupils would appear to be perfect 
i. Eo dek t at one word provides an unreliab)e estimate ol a pupil's 
Words a eh is obvious. It should be equally apparent that as we add spelling 
ee a list, we come closer and closer to a good estimate of each child’s 
So to a ility. Scores based ona large number of spelling words are more 
Thus, E ect real differences in spelling ability and therefore to be more stable. 

» by increasing the size of the sample of spelling behavior we increase the 


k a oÍ our measurement. 
. test also tends to lessen the 
seven a on an example, on a ten-item true-an I 
lsa 3 items and guess at the other three. He could guess right on all three 
end up witt ave a perfect score or he could guess wrong on all three items and 
is: test: ge h only seven correct. This would represent considerable variation in 
a test ive due to guessing alone. However, if this same pupil were taking 
cancelled "a hundred true and false items his correct guesses would tend to be 
indicatio y his incorrect guesses and the score would be a more dependable 
The f n of his actual knowledge. a. 
earlier a that a longer test tends to provide more reliable results was implied 
Scöies i. our discussion of the split-half method. It will be recalled that when 
Siana i two halves of a test correlated .60 the Spearman-Brown formula 
contani the reliability of the scores for the full-length test to be .75. This, of 
> is equivalent to estimating the increase 1n reliability to be expected when 


the 1 
fac of the test is doubled. 
ere j i 
ae re = one important reservat 
e reliability of the scores. That is, thi 


assum 
as th e that the test will be lengthened by ad 
Ose already in ‘the test. Adding ten spelling words that are so easy that 


tan will get them correct or adding ten spelling words that are so difficult 
ies will get them correct will not increase the reliability of the scores 
addio ing test. In fact there would be no influence on reliability since such 
= eae would not influence the relative standing of the pupils in the group. 
th sa structing classroom tests it is important to keep in mind the influence 
eae E on reliability and strive for longer tests. Where short tests are 
may be y because of time limits or the age of the pupils, more frequent testing 
Th a to obtain a dependable measure of achievement. 
tively eee tests, we should be wary of part scores based on rela- 
pre iy items. Such scores are usually low in reliability and of little or no 
checked apas Before using such scores the test manual should be carefully 
lega the or their reported reliabilities. If these are not reported, or are very 
used. part scores should be ignored and only the total test score should be 
coe Scores. As noted earlier reliability coefficients are directly influ- 
spread of scores in the group tested. Other things being equal, the 


influence of chance factors such as 
d-false test a pupil might know 


fluence of test length 


jon in evaluating the in 
e been making 


e statements we hav 
ding test items of the same quality 


92 The Evaluation Process 


larger the spread of scores, the higher the estimate of reliability. Since higher 
reliability coefficients result when individuals tend to stay in the same relative 
position in a group, from one testing to another, it naturally follows that any- 
thing which reduces the possibility of shifting positions in the group also con- 
tributes to higher reliability coefficients. In this case larger differences between 
the scores of individuals reduce the possibility of shifting positions. Stated 
another way, errors of measurement have less influence on the relative position 
of individuals where the differences among group members are large—that is, 
where there is a wide spread of scores. 

This can be easily illustrated without recourse to statistics. Compare the 
following two sets of scores in terms of the probability that the individuals will 
remain in the same relative position on a second administration of the test. 
Even a cursory inspection of these scores will make it clear that the persons 
in Group B are more likely to shift positions on a second administration of the 
test. With only a spread of ten points from the top score to the bottom score, 
radical shifts in position can result from changes of just a few points in the 
test scores of these individuals. 


Group A Group B 
95 95 
90 94 
86 93 
82 93 
76 92 
65 91 
60 89 
56 88 
53 86 
47 85 


However, in Group A the test scores of individuals could vary by several 
points, on a second administration of the test, with very little shifting in the 
relative position of the group members. The large spread of test scores in 
Group A makes shifts in relative position unlikely, and thus gives us greater 
confidence that these differences among group members are real differences. 

With the exception of mastery tests, teachers should attempt to construct class- 
room tests which result in a wide spread of scores. In this way the teacher can 
have greater assurance that the differences in achievement reflected in the test 
scores are reliable differences in pupil achievement and not differences due to 
chance factors such as guessing. To obtain a wider spread of test scores, most 
teachers need to construct more difficult tests. The traditional belief that a pupil 
should get 75 per cent of the test items correct (the “passing score”) has caused 
teachers to include too many simple questions in their tests. This assures that 
most pupils will pass but it also restricts the possible range of test scores and 
consequently contributes to the unreliability of the test. As the older concept 
of the “passing score” is discarded, it will be possible to broaden the spread 
of test scores and consequently to improve the reliability of classroom tests. 


Other Characteristics Desired in Evaluation 93 


In selecting standardized tests the influence of the spread of test scores on 
reliability coefficients should also be considered. For example, many test pub- 
lishers report reliability coefficients calculated on test scores over several grade 
levels. Since the combined scores of pupils from several grade levels have a 
much larger spread of scores than that found at a single grade level, such 
reliability coefficients are spuriously high. These reliability coefficients should 
be disregarded when selecting a test for a particular grade level. Every effort 
should be made to obtain reliability evidence on a group of pupils similar to 
the one to whom we plan to administer the test. Only in this way can we have 
some assurance that the reliability coefficients reported in the test manual 
provide a satisfactory estimate of the test's reliability for our particular group 
of pupils. I 

Difficulty of Test. Tests which are too easy or too difficult for the group 
members taking it will tend to provide scores of low reliability. This is due to 
the fact that both easy and difficult tests result in a restricted spread of scores. 
In the case of the easy test, the scores are close together at the top end of the 
Scale. With the difficult test, the scores are grouped together at the bottom end 
of the scale. For both, however, the differences among individuals are small and 
tend to be unreliable. 
j The implications for classroom testing are obvious and were touched upon 
in the previous section. Classroom achievement tests should be so constructed 
that the average score is 50 per cent correct and that the scores range from near 
zero to near perfect. This will assure the maximum spread of scores and increase 
the probability that individual differences in achievement are reliably measured. 

The difficulty of test items in standardized tests should also be carefully evalu- 
ated, Where a test has been designed for several grade levels the difficulty level 
1s usually most appropriate for the grades in the middle of the range. For exam- 
ple, the Gates Reading Survey for grades 3 to 10 is most effective for grades 5 
to 8. In the earlier grades the test is too difficult for the pupils, and in grades 
9 and 10 the test tend to be too easy.18 At these extreme grade levels, then, one 
can expect the differences between individuals to be less reliable. Information 
Concerning the difficulty of the test can be obtained from the test manual as 
well as from an inspection of the test items themselves. 

In evaluating the difficulty of a test the teacher must also take into account 
the level of ability of his pupils. A test which is of appropriate difficulty for 
Average fifth graders may be inappropriate for a fifth-grade class which con- 
tains a disproportionate number of slow learners or of gifted pupils. The relia- 
bility of the test scores would, of course, be lower where the difficulty of the 
test was inappropriate for the group members taking it. ; 

Objectivity. The objectivity of a test refers to the degree to which equally 
competent scorers obtain the same results. Most standardized tests of aptitude 
and achievement are high in objectivity. The test items are of the objective type 
feg., multiple choice) and the resulting scores are not influenced by the judg- 


**L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


94 The Evaluation Process 


ment or opinion of the scorers. In fact, such tests are usually constructed so 
that they can be accurately scored by trained clerks and scoring machines. 
Where such highly objective procedures are used the reliability of the test results 
are not affected by the scoring procedures. = 

In the case of classroom tests constructed by teachers, however. objectivity 
may play an important role in obtaining reliable measures of achievement. In 
essay testing, as well as in the use of various observational procedures, the 
results depend to a large extent upon the person doing the scoring. Different 
persons get different results and even the same person may get different results 
at different times. Such inconsistency in scoring has an adverse affect on the 
reliability of the measures obtained, for the test scores now reflect the opinions 
and biases of the scorer as well as the differences among pupils in the char- 
acteristic being measured. n 

The solution is not to use only objective tests and abandon all subjective 
methods of evaluation. This would have an adverse affect on validity and, as 
we noted earlier, validity is the most important quality of evaluation results. 
A more desirable solution is to select the evaluation procedure most appropriate 
for the behavior being evaluated and then to make the evaluation procedure 
as objective as possible. In the use of essay tests, for example, objectivity can 
be increased by careful phrasing of the questions and by using a standard set 
of rules for scoring. Such increased objectivity will contribute to greater re- 
liability without sacrificing validity. 

Methods of Estimating. When comparing the reliability coefficients of two 
or more standardized tests it is important to consider the methods that were 
used to obtain the reliability estimates. Other things being equal, the size of the 


ARN N I saqi shitty i e 
reliability coefficient is related to the method of estimatine reliability in th 
following manner: 


(1) Split-half method Highest reliability coefficients reported for 
a given test. Estimate inflated by factors such 
as speed. 

Medium to high reliability coefficients re- 
ported for a given test. May be higher than 
split-half method if time interval is short. 
Become lower as time interval between tests 
is increased. 

Medium to high reliability coefficients re- 
ported for a given test. 

Lowest reliability coefficients reported for a 
given test. Become lower as time interval be- 
tween forms is increased. 


(2) Test-retest method 


(3) Equivalent-forms method 
(without time interval) 

(4) Equivalent-forms method 
(with time interval) 


This variation in the size of the reliability coefficient due to the method of 
estimating reliability is directly attributable to the type of consistency included 
in each method. It will be recalled that the equivalent-forms method with an 
intervening time interval took into account all possible sources of variation in 


Other Characteristics Desired in Evaluation 95 


the test score: i i 
st scores and consequently is the most rigorous method of estimating re- 


pa ra Tos lower reliability coefficients can be expected with this method, 
a ° grossly unfair to make a direct comparison of such reliability coeffi- 
ents with those obtained by less stringent methods. 
ei extreme, the higher reliability coefficients reported for the split- 
m ra oe Pai be accepted cautiously and should not be compared directly 
ns la 4 ity coefficients obtained by other methods. If speed is an important 
nine n the testing, split-half reliability coefficients should be disregarded ` 
irely and other evidence of reliability should be sought. 


How High Should Reliability Be? 


The degree of reliability we shall demand in our educational measures de- 
are gy on the nature of the decision to be made. If we are going to use 
sores = as : basis for deciding whether to review certain areas of subject 
Our ane mene be willing to use a teacher-made test of unknown reliability. 

ion will be based on the scores of the total group. and inconsistency 


in indivi x š Pe p 
dividual scores will not distort our decision too much. Even if we do err 
trophe will result. The worst that can happen is 


ssary review of material or they will be de- 
al to them. On the other hand, if we 
asis for deciding which pupils should be 
ally handicapped we shall demand the 
We would not be satisfied with group 
to use one of the more 


"eras. no major catas 
niyaq Sp will get an unnece i 
aTa ay review that might be benefici 
mie sa to use test results as a b 
iat a ` q classes for the ment 
tests, oF a pea available. 

reliable ir ‘ agence for this purpose but would want ; 
atin, 4 ndividual measures of intelligence. We probably would alsa want to 
social Deut most reliable evidence available concerning the pupil s learning, 
eis velopment. and adjustment before a final decision is made, This decision 
peti ees and the consequences so significant that we are willing to devote 
Tiea ER le time and expense to increase the reliability of our data even if the 
the i as slight. We want to be as confident as possible that we are making 
Vand, decision when we place a pupil in a special class for the mentally 

It teapped. 

A Peen PAG the importance ot a 
. o confirm or reverse the ju 


f a decision that matters. but also whether it 
dgment at a later time.'* Decision making 


in e eee eee 
sta ducation is seldom a single. final act. It tends to be sequential in nature, 
rt : š x P 

ing with rather crude judgments and proceeding through a series of more 


refi ; ti š hil : 
his judgments. In the early stages of decision making low reliability might 
quite tolerable, because test results are used primarily as a guide to further 

assroom tests of question- 


nee gathering. For example. on the basis of cl s ; c 
y we might decide that some of our pupils are having learning 
decision = of such a serious nature that they are In need of special help. This 
testing provides a useful hunch that can be confirmed or refuted by further 

# with more dependable measures. Similarly. a personality inventory of 


"L : 
- J. Cronbach, Essentials of Psy chological Testing (New York: Harper & Row. 1960). 


96 The Evaluation Process 


low reliability may be useful as a first step in detecting maladjusted pupils, 
providing those with scores indicating possible maladjustment are followed up 
by more intensive study. Also, group scholastic aptitude scores of only moderate 
stability may be useful in grouping elementary pupils, since those who are 
misclassified can be easily shifted as new evidence becomes available. Oppor- 
tunities for confirmation and reversal of judgments without serious consequences 
are almost always present in the early stages of educational decision making. ; 

The important thing when reliability is low is not to treat the scores as if 
they were highly accurate. Make tentative judgments, seek confirming data, and 
be willing to reverse decisions when wrong. Some modification in school policy 
may also be required. If, for example, mental ability proves to be unstable until 
age sixteen, one should not adopt a classification policy which makes the de- 
cision about who shall plan to go to college at age eleven. In summary, test 
scores of low reliability can be useful if they are interpreted with caution and 
used only for tentative reversible decisions, 

Where final irreversible decisions are being made, we shall be, of course, 
compelled to seek the most reliable information available. We would not want 
to award scholarships, reject college applicants. or commit a person to a mental 
institution on the basis of measures with low or questionable reliability. 

Thus, when we ask the question—How high should reliability be?—severa 
considerations must be taken into account. How important is the decision? Is it 
one that can be confirmed or reversed at a later time? How far reaching are the 
consequences of the action taken? For important decisions which are irreversi- 
ble and apt to have great influence on the lives of individual pupils, we shall 
make stringent demands on the reliability of the measures we use. For lesser 
decisions, and especially for those that can be later confirmed or reversed 
without serious consequences, we shall be willing to settle for less reliable meas- 
ures. Thus, it depends largely on how confident we need to be about the decision 
being made. Greater confidence requires higher reliability.1 


USABILITY 


In selecting evaluation instruments, 
lected. Tests are usually administered 
minimum amount of training in measu 


practical considerations cannot be neg- 
and interpreted by teachers with only s 
rement. The time available for testing is 
almost always limited and is in constant competition with other important 
activities for its allotted time in the school schedule. Likewise, the cost of testing. 
though a minor consideration, is as carefully scrutinized by budget-conscious 
administrators as are other expenditures of school funds. These and other factors 


program. 


15 A. G. Wesman, Reliability and Con 


fidence. Test Service Bulletin No. 44 (New York: 
The Psychological Corporation, 1952). 


Other Characteristics Desired in Evaluation 97 


Ease of Administration 


: Where tests are to be administered by teachers or others with limited train- 
ing, ease of administration is an especially important quality to seek in a test. 
For this purpose the directions should be simple and clear, the subtests should 
be relatively few, and the timing of the test should not be too difficult. Admin- 
Istering a test with complicated directions and a number of subtests lasting 
but a few minutes each is a taxing chore for even an experienced examiner. 
For a person with little training and experience, such a situation is fraught with 
Possibilities for errors in giving directions, timing, and other aspects of the 
administration which are likely to affect the results. Such errors of administra- 
tion have, of course, an adverse affect on the validity and reliability of the 


" ñ 
esulting test scores. 


Time Required for Administration 
we shall always favor the shorter test, 


other things being equal. In this case other things are seldom equal, however, 
Since reliability is directly related to the length of the test. If we attempt to cut 

own too much on the time allotted to testing we are apt to reduce drastically 
the reliability of our scores. For example, tests designed to fit a normal class 
Period usually provide total test scores of satisfactory reliability, but their part 
Scores, obtained from the subtests, tend to be unreliable. If we want reliable 
measures in the areas covered by the subtests, we need to increase our testing 
time in each area. On the other hand, if we want a general measure in some 
Sen such as verbal intelligence, we can obtain reliable results in 30 or 40 
minutes and there is little advantage in extending the testing time. A safe pro- 
cedure is to allot as much time as is necessary to obtain valid and reliable 
results and no more. Somewhere between 20 and 60 minutes of testing time 
for each individual score yielded by a standardized test is probably a fairly 
800d guide, 


With time for testing at a premium, 


Ease of Scoring 
Traditionally, one of the most tedious and troublesome aspects of a school 
testing program has been the scoring of the tests. In the past, many an over- 
Worked teacher has spent hours upon hours at this task. To make the procedure 
Cven more burdensome than it needed to be, scoring directions were frequently 
complicated, the tests contained numerous subtests and some subjective test 
items, and the scoring keys were cumbersome. Although the scoring of tests is 
still a problem to he radial with, recent developments in testing have eased 
€ burden considerably. These developments include (1) the trend toward com- 
au objective standardized tests, (2) improved clarity in the directions for 
a ing and increased simplicity in the scoring key, (3) the use of separate 

wer sheets, and (4) machine scoring. 
ina selecting standardized tests, those w 
e, skill, and expense for the scoring shou 


hich require a minimum amount of 
ld be given preference. The use of 


98 The Evaluation Process 


separate answer sheets, for example, will not only contribute to ease of scoring 
but will also reduce the cost of testing due to the fact that the same test booklets 
can be used over again a number of times. In addition, iÍ machine scoring is 
available at a reasonable cost, separate answer sheets could relieve teachers of 
an irksome clerical task. Such factors should be taken into account at the time 
the test is being evaluated. and no test should be selected until the pare 
for scoring have been given careful thought. Other things being equal, AWG sha” 
favor the test which provides for ease and economy of scoring without sacrificing 
scoring accuracy. 


Ease of Interpretation and Application 


In the final analysis, the success or failure of a testing program is ns 
mined by the use made of the test results. If they are interpreted correctly an 
applied effectively they will contribute to more intelligent educational seniai 
On the other hand, if the test results are misinterpreted or misapplied or isp 
applied at all they will be of little value and may actually be harmful to some 
individual or group. " ie 

Information concerning the interpretation and use of test results is usuall) 
obtained directly from the test manual or related guides. Attention should be 
directed toward the ease with which the raw scores can be converted into eae 
ingful derived scores, the clarity with which the tables of norms are pasate 
and the comprehensiveness of the suggestions for applying the results to aed 
tional problems. Where the test results are to be presented to the pupils, or 


A . N = a ecial 
their parents, ease of interpretation and application should be given sp 
consideration. 


Availability of Equivalent or Comparable Forms 


For many educational purposes equivalent forms of the same test are often 
desirable. Equivalent forms of a test measure the same aspect of behavior by 
using test items which are alike in content, level of difficulty, and other signifi- 
cant characteristics. Thus, one form of the test can substitute for the other: 
This makes it possible to test pupils twice in rather close succession without 
their answers on the first testing influencing their 
testing. The advantage of equiv. 
ment gain. 


performance on the second 
alent forms is readily seen in studies of achieve- 
Here we want to eliminate the factor of memory while testing the 
pupils twice in the same area of achievement. Equivalent forms of a test may 
also be used to verify a questionable test score. For example, a teacher may 
feel that a scholastic aptitude or achievement test score is too low for a given 
pupil. This may be easily checked by 


Reet h ë 
administering an equivalent form of th 
test. 


Many tests also provide comparable forms. Achievement tests. for example, are 
commonly arranged in a series which cover different grade levels. Although the 
content and level of difficulty varies, the tests at the different levels are made 
comparable through a common score scale. Thus, it 


is possible to compare 
measurements in grade four with measure 


ments in grade six on a more advanced 


Other Characteristics Desired in Evaluation 99 


for: 
m of the test. Comparable forms are especially useful in long-range studies of 


educational growth. 


Cost 
it is relatively unimportant 


F 
he factor of cost has been left to the last because 
s that it is sometimes given 


ee The reason for discussing it at all is 
oula e neigi than it deserves. Testing is relatively ! 
smali Savine e s: maj Or consideration. In large-scale testing programs where 
aid MAME ou pupil add up. using separate a 
of another 1 sooklets will reduce the cost appreciably. 
ome ; however, because the test booklets are a fey is 

y. After all, validity and reliability are the important characteristics to 


look 

for, and a test lacking in these qualities is too expensive at any price. On 
alid and reliable test scores can make 
ests are always economical 


inexpensive and cost 


nswer sheets, machine scoring, 
To select one test instead 
v cents cheaper is false 


the 

other i i 

to ed her hand. the contribution that v 
ed i isi 7 

i ucational decisions seems to indicate nena 

n the lone run 


SUMMARY 
quality to seek in evalua- 


Nex ie A 
ext to validity, reliability is the most important 
and other evaluation 


s Reliability refers to how consistent test scores a 
bility hel from one measurement to another. In interpreting i | 
ie airs it is important to remember that reliability oe = to 
different S st measurement. that different ways of estimating reliabi ai a 
and that 5 pes of consistency. that a reliab I s not necessarily valid, 

at reliability is strictly a statistical concept. Reliability estimates may be 


reported i 
an EN in terms of a reliability coefficient or the standard error of measure- 
ent. J 


and using relia- 


le measure i 


al different methods and each 


Reliah; 
ability coefficients are determined by sever 
f consistency being investi- 


Coefficient ; 
ci s à ° i 
gated ent is given a name which specifies the type © 
= : 

“ olves giving the same test twice to the same 


The test-retest method inv 
and the resulting coefficient is called a 


group wi š 

coeff with an intervening time interval. ; 
‘ficient of stability. How long the time interval should be between tests 1s 

3 o be made of the results. We shall be primarily 


determi 
ained largely by the use t i 
ntervals comparable to the periods 


intere 

sted j As š Š 
ofti ted in stability coefficients based on ir 
me covered in our predictions. The equivalent-forms method involves giving 


aa of a test to the same group in close succession or with an intervening 
in a ee The first results in a coefficient of equivalence, and aa 
Most ils of stability and equivalence. The latter coefficient pon es the 
bees aig test of reliability, since it includes all possible eer eect 
tration Pn test score. Reliability can also be estimated from a single a ed 
of the boats single form of a test. either by correlating the scores on two halves 

st or by applying one of the Kuder-Richardson formulas. Both methods 


Provid 

as e a coefficient of internal consistency and are easy to apply. However. they 
ds ° 4 x, 4 x 
k and they provide no information concerning 


the st applicable to speeded tests 


ability 
ability of test scores from day to day. 


100 The Evaluation Process 


The standard error of measurement indicates reliability in terms oí the 
amount of variation to be expected in individual test scores. It can be computed 
from the reliability coefficient and the standard deviation, but it is frequently 
reported directly in test manuals. The standard error is especially useful in inter- 
preting test scores, since it indicates the “band of error” surrounding each score. 
It also has the advantage of remaining fairly constant from group to group. 

Reliability estimates may vary in accordance with the length of the test, the 
spread of scores in the group tested, the difficulty of the test, the objectivity of 
the scoring, and the method of estimating reliability. These factors should be 
taken into account when appraising reliability information. The degree and 
type of reliability to be sought in a particular instance depends primarily on 
the decision being made. For tentative reversible decisions low reliability may 
be tolerable. However, for final irreversible decisions we shall make stringent 
demands on the reliability of our measures. 

In addition to validity and reliability, it is also important to consider the 
usability of tests and other evaluation instruments. This includes such joer 
tical features as ease of administration, time required, ease of scoring, ease o 


: : : An : S. 
interpretation and application, availability of equivalent or comparable forms, 
and cost. 


SUGGESTIONS FOR FURTHER READING 


Anastasi, Anne Psychological Testing. 2nd edition. New York: Macmillan, 1961. Chapter 5: 
“Test Reliability.” sie 963 

Bauernfeind, R. H. Building a School Testing Program. Boston: Houghton Mifflin, 1963. 
Chapter 6: “The Concept of Reliability.” 

Cronbach, L. J. Essentials of Psychological Testing. New York: Harper & Row, 1960. Chap- 
ter 6: “How to Choose Tests.” ki 

Hoyt, C. J. “Reliability,” Encyclopedia of Educational Research. 3rd edition. New York: 
Macmillan, 1960. Pages 1144-1147. : 

Saupe, J. L. “Some Useful Estimates of the Kuder-Richardson Formula Number 20 Relia- 
bility Coefficient,” Educational and Psychological Measurement, 21, 63-71, 1961. 


Wesman, A. G. “Some Effects of Speed in Test Use,” Educational and Psychological Meas- 
urement, 20, 267-274, 1960. 


Test Bulletins* 


Diederich, P. Short-cut Statistics for Teacher-Made Tests. Evaluation and Service Series, 
No. 5. Princeton, N.J.: Educational Testing Service, 1960. Presents simplified methods of 
estimating the standard error and the reliability coefficient. 

Wesman, A. G. Comparability vs. Equivalence of Test Scores. Test Service Bulletin, No. 53- 
New York: The Psychological Corporation, 1958. 


* Also see Footnotes 12 and 15 in this chapter. 


Chapter 6 
principles and 


procedures of 
classroom testing 


Classroom tests play a central role in the evaluation of pupil progress. 
te ee provide direct measures of many important learning outcomes 
Son, tla rect evidence concerning others. . + + The validity of the informa- 

y provide, however, depends on the principles underlying their 


co i 
nstruction and use. 


tensive use of tests 
of our instructional 
cause standardized 


ment requires the ex 
s is so because many 
d-pencil tests and be 
cular objectives we emphasize in our 
a variety of instructional purposes 
truction that standardized tests cannot 
ant to measure achievement at the end 
difficulty which has just come to our 
e mastered a specific skill. For 
find it necessary to 


Se nme of pupil achiever 
otifoomes by classroom teachers. Thi: 
tesis fite T be measured by paper-anc- 
teachin P om well adapted to the parti 
and a n addition, we use tests for such 
Possibly bm varying segments of ins 
of a er 7 For example, we may w 
attention of work, diagnose a learning : 
these, ea. check on how well the pupils have 
construct other instructional purposes, We will usually i i 

tests which are uniquely adapted to our particular situation. 


PRELIMINARY CONSIDERATIONS 


Rr earlier chapters, the actual construction of test questions should 
sinine 5 by a series of preliminary steps- First, the objectives and specific 
in m oe must be identified and defined in terms of desired changes 
es ae al avior. Second, the subject-matter content must be outlined. Third, 

specifications, which relates the objectiv! 


103 


es to the subject-matter con- 


104 Constructing Classroom Tests 


tent, should be developed. Finally, specific test questions are constructed in 
accordance with the table of specifications. 


IMPORTANCE OF THE TABLE OF SPECIFICATIONS 


The only assurance we have that a classroom test is a valid measure of the 
objectives and course content we are interested in testing is to use some oye 
tematic procedure for obtaining a representative sample of pupil behavior an 
each of the areas. The table of specifications is a device which provides ir 
systematic procedure. By listing the objectives across the top of the table E 
the subject-matter topics down the side of the table, we can weight each w 
of the table in accordance with the importance we attach to each objective an 
each subject-matter topic.! These weights indicate the number of test items F 
be devoted to each objective and each subject-matter topic. By constructing ie 
items which fit these specifications, it is possible to build a test which ie 
both the objectives and the subject-matter in a representative manner. In rn) 
words, the emphasis in our test will reflect the emphasis in the table of speci 
cations which, of course, should reflect the emphasis in our teaching. dia 

Unless a table of specifications, or some similar device, is used as a guide = 
test construction, there is a tendency to overload the test with items measuring 
knowledge of isolated facts and to neglect the more complex learning outcomes: 
In the social studies area, for example, it is not uncommon to include a dispro- 
portionately large number of items which measure knowledge of names, dates 
places, and the like. In science, the defining of terms and the naming of ae 
tures and functions is commonly overemphasized. In mathematics, computationa 
skill is frequently the only learning outcome measured. In language arts oe 
literature, the identification of parts of speech, literary characters, authors, an 
the like, are frequently all too prominent. These learning outcomes are, peur 
erally, not stressed because the teacher thinks knowledge of isolated facts 18 
more important than understandings, applications, interpretations, and various 
thinking skills. Rather, they usually receive undue prominence because E 
teacher finds it easier to construct such test items. Without a carefully develope 
test plan, ease of construction all too frequently becomes the dominant criterio" 
in selecting and constructing test items. As a consequence, the test measures a 
limited and biased sample of pupil behavior and neglects many of the learning 
outcomes considered most important by the teacher. In short, without a carefully 
developed test plan, the test tends to lack content validity. 


TYPES OF CLASSROOM TESTS 


Tests constructed by teachers are usually divided into two general types: 
(1) the objective test which is highly structured and requires the pupil to 
supply a word or two, or to select the correct answer from among a limited 
number of alternatives, and (2) the essay test which permits the pupil to select, 


1 See Chapter 3 for illustrative tables of specifications. 


Principles and Procedures of Testing 105 


organize, and present his answer in essay form. There js no conflict between 


these two types of tests. For some instructional purposes the objective test may 
be most efficient while for others the essay test may prove most satisfactory. 
Each should be used where most appropriate, with appropriateness determined 
by the learning outcomes to be measured and by the unique advantages and 


limitations of each type. 


The Objective Test 


The objective test includes a variety of item types. Objective test items can 
ply the answer and those 


be classified into those which require the pupil to sup yi 
which require him to select the answer from a given number of alternatives.” 
These may be further subdivided into the following basic types of test items: 


Supply types: 
1. Short answer. 
EXAMPLES 


| What is the name of the author of Moby-Dick? (Herman Melville) 


What is the formula for hydrochloric acid? (HC) 
What is the value of X in the equation 2x +5=9? (2) 


2. Completion. 


EXAMPLES 

Lines on a weather map joining points with the same barometric pressure are called 
CGsobars) . 

The formula for ordinary table salt is (NaCl). 


In the equation 2X + 5 = 9; x= (2). 


Selecti 
| aw Types: 
+ True-false or alternative response. 
EXAMPLES 
© F A virus is the smallest known organism. 
T ® An aton is the smallest particle of matter. 
Yes (No) In the equation 2X + 5 = 9, X equals 3. 
Wes No Acid turns litmus paper red. 


2. Matching. 


| EXAMPLES 

(C) 1. And A Adjective 

(D) 2. Dog B Adverb 

(G) 3. Jump Cc Conjunction 

(P) 4. She D Noun 

(B) 5. Quickly E Preposition 
F Pronoun 
G Verb 


R. L. Ebel, “Writing the Test Item,” Educational Measurement, ed. E. F. Lindquist. 


W. . 
ashington, D.C.: American Council on Education, 1951. 


106 Constructing Classroom Tests 
3. Multiple Choice. 


EXAMPLES 


Why is the inhaling of carbon monoxide harmíul to man? 
A It causes increased blood pressure. 
B It damages lung tissue. 
It destroys red blood cells. 
D It prevents oxygen from entering the lungs. 
In the equation 2X + 5 = 9, the 2X means 
A 2plus X. 
B 2 minus X. 
C 2 divided by X. 
@) 2 multiplied by X. 
Which of the following sentences has a disagreement between subject and verb? 
A When they win, they are happy. 
Politics are hard to understand. 
C The majority is always right. 
D One or the other is to be elected, 


In addition to these basic types of objective test items, there are numerous 
modifications and combinations of types. There is little to be gained from a 
listing of all the possible variations, since many are unique to particular objec- 
tives or to specific subject-matter areas. Some of the more common variations 
used to measure understandings, thinking skills, and other complex learning 
outcomes will be illustrated later. These, plus an understanding of the general 
principles of test construction and of the principles that apply to each of the 
specific types of objective test items should enable the teacher to make adapta- 
tions which best fit his particular purposes. . 

The various types of objective test items have one feature in common which 
distinguishes them from the essay test. They present the pupil with a highly 
structured task which limits the type of response he can make. To obtain the 
correct answer, the pupil must demonstrate the specific knowledge, understand- 
ing, or skill called for in the item. He is not free to redefine the problem or to 
organize and present the answer in his own words. He must select one of several 
alternative answers or supply the correct word, number, or symbol. This struc- 
turing of the problem and restriction on the method of responding contributes to 
objective scoring which is quick, easy, and accurate. On the negative side, this 
same structuring makes the objective test item inappropriate for measuring 
the ability to select, organize, and integrate ideas. To measure such outcomes 
we must depend on the essay test. 


The Essay Test 


The essay test is commonly viewed as a single item type. A useful classifica- 
tion, however, is one based on the amount of freedom of response allowed the 
pupil. This includes the extended response type where the pupil is given almost 


Principles and Procedures of Testing 107 


complete fr edo i ing hi 
li it ú € ‘i In making his response and the restricted response type where 
mitations are g! gani i i 

p aced on the nature, length, or organization of his response.š 


These types are illustrated below: 


1. Extended Response Type. 


Federal Government in main- 


Describe w| ; 
eseribe what you think should be the role of the 
specific policies and programs 


taining : Inj S 
mie a stable economy in the United States. Include 
and giv 
nd give reasons for your proposals. 


2. Restricted Response Type. 


State tw š , 
e two advantages and two disadvantages of maintaining high tariffs on goods 


from other countries. 


Tt wi ' 
will be noted. in the above examples, that the extended response type ques- 


ti Ë 

inet to decide which facts he thinks are most pertinent, to 

sat dy er ips od of organization, and to write as much as he deems neces- 
ide a comprehensive answer. Thus, such questions tend to reveal 

hem in a coherent manner, and to express 

flect individual differences in 


onse type of question, it has 


ended resp > 
4 Tt is inefh- 


ere limitations on its use. 
aterial, since the questions are So 
he scoring is di N be covered in any one test. 
array of hele oe and apt to be unreliable because the answers include an 

varying des ual information of varying degrees of correctness, | wi 
and onei grees of coherence, and expressed with varying degrees of legibility 
seness, 


The š 
` restricted response type of question minimizes some of 
« the type of response called for makes 


e 
it nas response type. Restricting 
difficulty a for measuring knowledge of factual material and reduces the 
task Diësen the scoring somewhat. On the other hand, the more highly structured 
aS a meas ted by the restricted response type question makes it less effective 
of the sewa of the ability to select, organize and integrate ideas, which is one 
As wh t Purposes to be served by the essay test. 
Sponse L the various forms of objective test items, 
Mapia == question nor the restricted response type question can ; 
S equally well. The type to use in ap tuation depends mainly 


organized with 


th the weaknesses of 


5 neither the extended re- 
serve all 


articular si 


4 Te; 
1963) - Ahmann and M. D. Glock, Evaluating Pupil Growth (Boston: Allyn and Bacon, 


J. S 
losg; ` Ahmann and M. D. Glock, 


Evaluating Pupil Growth (Boston: Allyn and Bacon, 


` Educational Measurement, ed. E. F. Lindquist 


L. Ebel. “Writi 
Ebel, “Writing the Test Item,” 
ion. 1951). 


Ngton. wen ; " 
ston, D.C.: American Council on Educat 


"g. 
(Washi 


108 Constructing Classroom Tests 


on the learning outcomes to be measured and to a lesser extent on such practical 
considerations as the difficulty of scoring. 


COMPARATIVE ADVANTAGES OF OBJECTIVE AND ESSAY TESTS 


From our previous discussion, it is apparent that both the objective test and 
the essay test can provide valuable evidence concerning pupil achievement. Each 
has its unique advantages and limitations which makes it more appropriate for 
some purposes than for others. A comparison of the relative merits of tests 
based on these two item types, with regard to a number of important charac- 


teristics, is presented in Table 6.1. 


Table 6.1 


COMPARATIVE ADVANTAGES OF OBJECTIVE AND ESSAY TESTS 


Objective Test 


Essay Test 


Efficient for measuring knowledge 
of facts. Some types (e.g., multi- 
ple choice) can also measure un- 


Inefficient for measuring knowl- 
edge of facts. Can measure act 
standings, thinking skills an 


Learning derstandings, thinking skills, and other complex learning outcomes 
Outcomes other complex outcomes. Inefficient (especially useful where origi 
Measured or inappropriate for measuring nality of response is desired). AP- 


ability to select and organize ideas, 
writing abilities, and some types 
of problem-solving skills. 


propriate for measuring ability to 
select and organize ideas, writing 
abilities, and problem-solving skills 
requiring originality. 


Preparation 
of Questions 


A relatively large number of ques- 
tions needed for a test. Prepara- 
tion is difficult and time consuming. 


Only a few questions are needed 
for a test. Preparation is relatively 
easy (but more difficult than gen- 
erally assumed). 


Fi : m š > is 
Sampling Provides an extensive sampling of Sampling of course content j) 
of Curse course content, due to the large usually limited, due to the a 
Context number of questions that can be number of questions that can 
included in a test. included in a test. — 
: TF ; s 
Complete structuring of task limits Freedom to respond in own eT 
Goni pupil to type of response called enables bluffing and writing SK! 
of Pupil for. Prevents bluffing and avoids to influence the score. However, 
Response influence of writing skill. How- guessing is minimized. 
ever, selection-type items are sub- 
ject to guessing. 
Scoring Objective scoring which is quick, Subjective scoring which is slow, 
easy, and consistent. difficult, and inconsistent. 
Usually encourages pupils to de- Encourages pupils to concentrate 
velop a comprehensive knowledge on larger units of subject matter, 
of specific facts and the ability to with special emphasis on the abil- 
Influence 


on Learning 


make fine discriminations among 
them. Can encourage the develop- 
ment of understandings, thinking 
skills and other complex outcomes, 
if properly constructed. 


ity to organize, integrate, and ex- 
press ideas effectively. 


Principles and Procedures of Testing 109 


ntages of these two main item types as a 


I ; = 
n considering the comparative adva 
be careful not to fall into the “either 


basi “g: 

we ope ol ning Taat ie, we we ir ese que 

thea. edo feann “a at is, we use either objective questions or essay ques- 

measar Fe y more valid to use both types in a single test, with each 

should also i ase learning outcomes for which it is best suited. This 

in fot ancl e a desirable influence on the pupil’s learning, since in prepar- 
h tests he has to devote attention both to specific facts and to their 


integer, š 4 
at 
gration into more general understandings. 


GENERAL PRINCIPLES OF TEST CONSTRUCTION 


In th 
e : 
process of constructing classroom tests, We frequently focus our atten- 


tion so 
n j : a z 
arrowly on the detailed procedures for constructing specific types of 
which is to develop a valid 


test i 

ET we lose sight of our major purpose; 

careful selimi evaluating pupil achievement. This tendency can be offset by 

within which stipe s planning and by the use of a general frame of reference 
$ o view the specific procedures of test construction. The following 


principl 

e p : 

ples of classroom testing provide such a frame of reference. 

re into account the use to be served 


1. 

by “ale lg procedures must tak a ; 
unit of Work we are interested in determining a pupil's readiness to start a new 
knowledee nie) mew course, our test can be confined to a limited area of 
concerned a skill. For example, a pretest in fourth-grade science might be 
algebra aia ely with knowledge of science terms, a pretest in ninth-grade 
eginnine F t be limited to computational skill in arithmetic, and a pretest In 
lish aa rench might be confined to the measurement of knowledge of Eng- 
of some m8 and usage. In addition to being concerned with a limited aspect 
of difficult ia a À area, the pretest also tends to have a relatively low level 
trying to z a is due to the fact that when we use a pretest we are usually 
etermine if the pupils have the minimum background necessary to 


Proc n 

co the course or unit of work. ; - ae 
tery test The of test which is concerned with minimum essentials js the mas- 
to which JETESE is given at the end of a unit or coume to getermune the extent 
mental to pupii have mastered knowledges or skills considered to be funda- 
i uture learning. As with the pretest, the mastery test is usually limited 


ln s 
oe of a relatively low level of difficulty. . f N j 
emphasis S purpose is to diagnose the learning difficulties of pupils, our 
asie " be on how the pupils respond to specific problems or situations. 
slight oe need to include a number of test items 1n each area with some 
ations from item to item. In diagnosing pupils’ difficulties in adding 


whol 

en . Š tgs 
umbers, for example, we would want to include several addition prob- 

veral that do require carrying to deter- 


f difficulty. Since our focus is on the 
be constructed in light of the most 


The difficulty of the test items 
ant in constructing diag- 


lems 

ee Se a carrying and se 

pupils’ tas t is is a possible source o 

common ae difficulties, our test must 9e, 

and the total ces of error encountered by pupils. 
al score on the test are relatively unimport 


110 Constructing Classroom Tests 


nostic tests. Here we are concerned with 


individual 
part scores and responses to individua 
test items. 


When our test results are to be used to evaluate pupil progress toward P 
tional objectives, we are interested in constructing a general ol ra 
Such a test enables us to rank pupils in order of their achievement an sie 
tify general areas of weakness. For such purposes, we want a test that s s 
a representative sample of the course objectives and course conten fies to 
difficult enough to provide a reliable ranking of pupils, and that contri 
improved teaching and learning. ; , 

Most tests ponsirqetel] by classroom teachers are of the sean a 
Consequently, the principles and suggestions in this chapter pertain most ge 
to the general achievement test. In applying them to pretests, mastery = eed 
diagnostic tests, it is necessary to take into consideration that these test hae 
are more limited in scope, generally have lower difficulty, and are less concer 
with the relative achievement of pupils. š 

2. The types of test items ised should be determined by the specific learning 
outcomes to be measured. In construct: 
concerns is that the test items 
terms, state facts, apply knowl 
indicated in the specific lear 
objectives. This is necessary 
items as evidence that the 


ajor 
ing a classroom test one of our maj 


F he 
the nature of the test items selected should depend chiefly on t 
nature of the outcomes to be 


. . . a: i and 
Each type of test item is efficient for measuring some learning outcomes 
inefficient, or inappropriate, 


è for 
: for measuring others. The short-answer type, 
example, effectively measures 


the recall of specific facts but is generally inappro- 


priate for measuring understanding, 


plex learning oute 
most useful wher 
falsity of a statem: 
between appropri 
the true-false ite 
learning outcom 


Tnative-response type of item is 
either determining the truth or 
between fact and opinion, or discriminating 
ate responses. As with the short-answer type: 
inadequate for measuring the more complex 
type of item is similarly restricted, It is lim- 
utcomes that call for the identification of sim- 
to classify things into given categories. The 


1 generally adaptable type of test item. It can be 
used eflectively in ing a variety 


Principles and Procedures of Testing 111 


measuring the ability to organize data, for measuring the ability to present 
original ideas, and for some types of problem-solving activity. 

Whether a test item actually measures the particular behavior called for in a 
specific learning outcome depends, of course, to a large extent on the skill with 
which the test item is constructed. No amount of skill, however, will enable us 
to develop a valid test of achievement if the test items selected for use are inap- 
propriate to the specific learning outcomes to be measured. 

3. Test items should be based on a representative sample of the course con- 
tent and the specific learning outcomes to be measured. A test, no matter how 
extensive, is always a limited sample of the many possible test items that could 
be included, For example, we expect pupils to know thousands of specific facts 
but we can test for only a limited number of. them; we expect pupils to develop 
understandings which are applicable to innumerable situations but we can test 
for application to only a limited number of situations; and we expect pupils to 
develop thinking skills which will enable them to solve a variety of problems 
but we can test their problem-solving ability with only a limited number of 
problems. In each area of content and for each specific learning outcome, then, 
We merely select a sample of pupil behavior and accept it as evidence of o 
ment in that area, We assume that the pupils’ responses to our selecte set of 
test items are typical of what his responses would be like to other es items 
drawn from the same area. This means, of course, that our limited samples must 
be selected in a way to provide a representative sample in each of the various 
areas for which the test is being developed. 

As noted earlier, some types of course š 
more easily patterned into test questions than others. It is 
Struct test items which measure knowledge of specific icis oe ing ability 
skill, for example, than to develop measures of understanding, reasoning Si 
and thinking skills. If we are to construct a test which measures i = aoe suit 
Sample of pupil behavior, we must resist the temptation to Cable Eanes 
areas where test items are easily constructed. Following our ta on (san = 
tions as closely as possible seems to provide the best assurance s i 
Will measure the various areas of course content and learning Oute 


alanced manner. 


content and learning outcomes are 
much easier to con- 


d computational 


tive sample of 


aining a representa! I 
to be included 


other im i ion in obt 

important considerati : 
Pupil behav ini est items 
‘i ior i e i mum number of tesi A 
l ior is the optimum mini ee leen 


ìn an area. This is especially important in measuring ees i. 
Outcomes where there is a tendency to use fewer items. 


š i i o measure the 
for the interpretation of graphs, for instance, 1S snare pe of ie 
Pupil’s ability to interpret graphs. The nature of the data 3 sana 
may be the most influential factor in determining whether he = ac 
Ba single test item. Where several, or more, items are used, the inl nied 
Such specific factors are minimized and we obtain a more representa’ 7o 
of the pupil’s ability to interpret graphs. How many test items are = T : 
obtain a representative sample of behavior in any given area 1S ifficult to 


112 Constructing Classroom Tests 


specify. Other things being equal, the greater the number of test uate = 
more adequate the sample and the more reliable the results. Thus, as ae ot 
we probably should strive for at least several test items in each area stag 
if feasible. Although the total length of the test and the time availa a 
administering the test will restrict the number of items we can include x re 
given area, our major aim should be to obtain as representative a sample 
upil behavior as possible. 

j + Test items should be of the proper level of difficulty. In measuring k 
tent to which pupils are achieving our course objectives, we have no abso u 
standard by which to determine their progress. A pupil’s achievement aR : 
regarded as high or low only by comparing it with the achievement of ot pe 
pupils. Consequently, the major purpose of a general achievement test is 
provide a reliable ranking of pupils in order of their achievement. We ae 
like the pupil who has achieved our course objectives to the highest degree x 
have the highest test score and to have the other pupils ranked below him 1r 
accordance with the degree to which they have achieved the course ajo 
In addition, we would like the differences among test scores to be large enough 
to assure us that our test is measuring real differences in achievement. 


7 : ne ae ' i tained 
Maximum differentiation among pupils in terms of achievement is ob 
when the average score is ap 


the scores range from “near 
example, an avera 


ideal. This 1 


ectively by striving to construct test items ane 
is, items which are so difficult 
pils will answer them correctly. 
ill be easier than the 50 per cent 
he ideal toward which we should 
f each test item, Except for a few items at the begin- 
ational purposes, rione of our items should be so easy 


are at the 50 pe 
that only appro 


terms of their achievement, 


In attempting to construct test items of a higher level of difficulty we should 
a 


make a special effor tting to undesirable methods for obtaining 
difficulty. It is not uncommon, for example, to use more obscure, less important 


factual information to increase the difficulty of test items. This generally leads 
to a lessening of 


content validity and may also be undesirable from a learning 
standpoint, Pupils are apt to concentrate their efforts on the learning of the 
less important material and to neglect the more important learning outcomes. 
A closely related method of achieving difficulty at the expense of validity is 
to require pupils to make difficult but unimportant discriminations, In the fol- 


t to avoid reso 


Principles and Procedures of Testing 113 


lowing test items, for example, note how the significance of the information 


decreases as the difficulty increases. 


The government of the United State was declared in effect under the Constitution in 
A 1787. 
B 1788. 


© 1789. 


D 1790. 


The government of the United States was declared i 
Á January. 
B February. 
March. 
D April. 
The government of the United States was decla: 
A Monday. 
B Tuesday. 
Wednesday. 
D Thursday. 


n effect under the Constitution in 


red in effect under the Constitution on 


t Asking pupils to make fine discriminations, of course, does not always lead 
o ; Mae: 
learning outcomes of less significance. However, we need to be on guard 


a inst such dangers when increasing the difficulty of our test items. Other 
things being equal, the best way to increase the difficulty of test items is to move 
complex nature, such 


t 
rand the: measurement of learning: outcomes of a more eaS 
s knowledge of principles, understanding of concepts, application of principles 


and 

S cenogpts and the like. I : 
_ Although the difficulty of test items is of major concern in achievement test- 
mE tests. As we noted earlier, tests 


whi it is of little significance in other classroom : 5 
ich are used for measuring mastery of minimum essentials and for diagnosing 


earning difficulties generally have a relatively low level of difficulty. With such 
tests, we are maily concerned with each pupil’s individual responses to the 
test items so that we may better analyze his shortcomings and apply appropriate 
remedial procedures. Since his total test score and his relative standing in the 
group are of little or no consequence; the difficulty of the items is of minor 
I Portance, The nature of the course content to be mastered and the area m 
Which learning difficulties are to be diagnosed are the important factors. The 
difficulty of ha test items result directly from such considerations rather than 
rom some predetermined level of difficulty considered to be desirable. 

5. Test items should be so constructed that extraneous factors do not prevent 


the . ; ieved rticular learning 
Pupil from š; en a upil has achieved a pa 5 
j responding: Wis pu m to obtain correct 


ou sie ; . 
a tcome (e.g., knowledge of basic principles); we want hi inl 
Mswers to those test items which measure the attainment of that learning out- 


“ig We would be very unhappy (and so would he), if he answered such test 
ems incorrectly merely because the sentenci as too complex, the 


voc, 
ey too difficult, or the type of response ca 
ors, which are extraneous to the central purpose 


e structure W: 
lled for too vague. These 


of the measurement, limit 


114 Constructing Classroom Tests 


and modify the pupil’s responses and prevent him from showing the true level 
of achievement he has attained. Such factors are as unfair as determining a 
person’s running ability when he has a sprained ankle. Although a measure 
of running ability would be obtained, the performance would be restricted by a 
factor we did not intend to include in our measurement. 

One way to eliminate factors which are extraneous to the purpose of measure- 
ment is to be certain that all pupils have the prerequisite skills and abilities 
needed to make the response. These have been called “enabling behaviors 
because they enable the pupil to make the response but are not meant to be 
critical factors in the measurement.6 That is, they are a necessary but not a 
sufficient condition for responding correctly. Probably the most important 
“enabling behavior” in objective testing is reading skill. In essay testing, skill 
in written expression is also a factor to 


be considered. In measuring under- 
standings, 


thinking skills and other complex learning outcomes, knowledge of 


certain specific facts and simple computational skills might also be necessary 
prerequisites. 


In constructing test items, then, we need to strive for items which measure 


achievement of the specific learning outcomes and not differences in “enabling 
behaviors.” Differences in reading ability, computational skill, communication 
skills, and the like, should not influence the pupils’ responses unless such out- 
comes are specifically being measured,’ The only functional difference between 
those pupils who get an item correct and those who miss it should be the pos- 
session of the knowledge, understanding, or other learning outcome being 
measured by the item. All other differences are extraneous to the purpose of 
the item and their influence should be eliminated or controlled for valid test 
results. 

A special problem in preventing extraneous factors from distorting our test 
results is that of avoiding ambiguity. Objective test items are especially sub- 
ject to misinterpretation where long complex sentences are used, where the 
vocabulary is unnecessarily difficult, and where words which lack precise meaning 
are used. Thus, the antidote for ambiguity seems to be a careful choice of words. 
from the viewpoint of both level of reading difficulty and preciseness of mean- 
ing, and the use of brief, concise sentences. 

6. Test items should be so constructed that the pupil obtains the correct 
answer only if he has attained the desired learning outcome. This is the coun- 
terpart of the preceding principle. In that one, we were concerned with those 
factors which prevent a pupil from responding correctly even though he has 
attained the desired learning outcome. Here, we are concerned with those factors 
which make it possible for the pupil to respond correctly even though he lacks 
the necessary achievement. These are the clues, some rather obvious and some 
very subtle, which inadvertently creep into test items during their construction. 
They lead the nonachiever to the correct answer and thereby prevent the items 


J. Furst, Constructing Evaluation Instruments (New York: Longmans, Green, 1958). 
7 One re. 


ason the oral examination has been largely discarded as a technique for measuring 
achieve: 


ment is the fact that the results are too dependent upon skill in oral expression. 


Principles and Procedures of Testing 115 


from functioning as intended. When test items are short-circuited in this 
manner, they of course provide invalid evidence concerning achievement. i 

The most obvious clues in test items are probably those due to grammatical 
structure. In the following item note how the article an provides a direct clue to 
the answer, j 


A porpoise is an 
A plant. 
B reptile. 
animal. 


D bird. 


; ; i ly- 
Such clues are not limited to selection-type 1tems- They also appear in supp!y 


type items, as indicated in the following illustrative item. 


i cater is known as an ——ə,— ,-—ə—-—, 
A piece of land that is completely surrounded by water 1S known as 


I The clue is much less obvious, to the person constructing the test seer than 
it was in our first illustration. To the pupil taking the tests howe it x. 
readily apparent. The two most plausible answers are “island = kasar i 

1nce peninsula begins with a consonant sound and — ame a "e 
an, it is ruled out as a possibility- This does not imply, © u 


i i uch 
need to know the rules for good grammatical structure in ce Ai s 
y ar 
clues. Most clues are not analyzed and evaluated, as above. Rat er ie = 
responded to in terms of partial knowledge and hunches. “An peninsula Jus 
oes not sound right to the pupil so h 


; > 
e responds with the word “island” and 

ahi 

btains the correct answer. 


Leads to the correct answer may also be provided by amn n 
tions, Note how the word “wind,” in the following item, provides a c 


answer, 
i e direction of the wind? 

Whi is used to determine the 

ch one of the following instruments is US 

e following ins! 

Á Anemometer. 
B Barometer. 
c Hygrometer. 

Wind vane. 


lead t ea e un e rrect answer, such clues should 

ather than d th informed to the co clue 

' ac i : way from the correct answer. In the following item, th 
e nonachiever away 


Same cl “wind vane” a plausible (but incorrect) answer for those 
Sos ° se msj the uses of the various weather instruments. 
who have no 


N ine the speed of the wind? 
Which one of the following instruments !Š used to determine t 
Anemometer. 
B Barometer. 
C Hygrometer. 
D Wind vane. 


not be as obvious as those illustrated above. In fact, the 
final version of a test are usually rather subtle. They 
ledge and clang associations not readily apparent to 


Verbal clues need 
clues which appear in the 
are based on partial know 


116 Constructing Classroom Tests 


the casual observer. For example, at first glance the following item appears to 
be free from clues. 


Which one of the following is used to prevent polio? 
A Gamma globulin. 
B Penicillin. 
© Salk Vaccine. 
D Sulpha. 


A careful examination of this item, however, will indicate that the word 
“vaccine” provides a clue to the answer. All the pupil needs to know to —_ 
the item correctly is that “vaccine” is used to “prevent” disease. Since mos 
pupils have been vaccinated at one time or another, they probably possess this 
partial knowledge needed to make the clue apparent to them. Some pupils may 
also have developed a clang association between “Salk” and “polio” and re- 
spond correctly on that basis. In either case, partial knowledge can lead to 
the correct answer and prevent the item from functioning as intended. 

Another type of subtle clue is that based on the words used to qualify state- 
ments. For example, true-false statements that include qualifiers such as = 
times,” “usually,” “generally,” and the like are most frequently true, whale 
statements containing absolutes such as “always,” “never,” “none,” “only,” are 


A ; e 
most frequently false. Such words have been called specific determiners. They 
are difficult to remove from true-false items beca: 


must be qualified and false statements frequentl 
absolutes to make them clearly false. 

Clues, which prevent test items from functioning as intended, can usually 
be eliminated during test construction. In fact, many of the suggestions for 
constructing each of the specific types of test items are aimed directly at aie 
ing such clues. It is also helpful to analyze each completed test item in terms o 
the mental process a pupil must use to obtain the correct answer and to gonm 
pare this with the intended purpose of the item. Only when these two are $i 
harmony, can we be fairly certain that irrelevant factors are not operating an 


" ing 
that correct answers to the test items indicate attainment of the desired learning 
outcomes. 


use true statements gereali 
y must be stated in terms 0 


7. The test should be so constructed that it contributes to improved teaching- 
learning practices. The ultimate purpose of testing, as with all classroom pro- 
cedures, is to improve pupil learning. Thus, any classroom test we construct 
should be evaluated in terms of the extent to which it contributes, directly or 


indirectly, toward this end. A well-constructed classroom test should increase 
both the quantity and quality 


ing procedures, and 

One way of assur 
is to pay particular 
measured by our tes 
of the areas covere 


Principles and Procedures of Testing 117 


| See learning outcomes, the pupil soon learns that a mass of memorized 
Facts, den — is not sufficient. He must also learn to interpret and apply 
dons, F e 2 conceptual understandings, draw conclusions, recognize assump- 
pupil ae y cause-and-effect relations, and the like. This discourages the 

m placing sole dependence on memorization as a basis for learning 


and en e. 
courages him to develop the use of more complex mental processes. 
re a variety of learning out- 


T ; 2 š 
he practice of constructing tests which measu 
cedures. As we translate the 


ea onal alse lead to improved teaching pro a 
enia] earning outcomes into test items, we develop a better notion of the 
ing 7 oe involved. Thus, the functional nature of understandings, think- 
us, Thi s and other complex learning outcomes become increasingly clear to 
esses <= clarification of how achievement is reflected in terms of mental proc- 
th ables us to plan the learning experiences of pupils more effectively. Fur- 
other co x hasize understandings, thinking skills and 
testing = ia learning outcomes in our teaching when we include them in our 
test fre 1s may seem a case of the cart pulling the horse, but a well-constructed 
of tho quently leads to a review of teaching procedures and to the abandonment 
A se which encourage rote memorization. ~ 
os . will contribute to improved teacher-pupil relations if pupils = the 
som 2 fair and useful measure of their achievement. We can make E 
the an by including a representative sample of the subject-matter oe 
tions wasa: we have emphasized during instruction, by writing clear os 
aS iver 4 making certain that the intent of each test item is clear and m x 
Providir correctly by any pupil who has achieved the desired outcome, A y 
Pa Eo adequate time limits for the test. Pupil recognition of usefu eH 
char, $s depends as much on what we do with the results of the test as on e 
acteristics of the test itself. We make the usefulness apparent by using the 


Tesult 
s : “a: : : 
as a basis for guiding and improving 


SUMMARY 

he extent t 
ed important by 
h of the areas being teste 
ning for the test. 

e classified as objective 
he following basic 


erm 
Ore, we also are more apt to emp 


learning. 


o which pupils are achiev- 
the teacher. A rep- 
d is more 


“asap: tests provide information on t| 
tesentath, of the learning outcomes consider 
apt to = sample of pupil behavior in eac i 
result if a table of specifications 1 used in plan 

tests constructed by classroom teachers may b 
or essay tests. These may be further subdivided into t 
of test items: 
Objective test: 

s Supply type 
l. Short answer 
2. Completion 
Selection type 
l. True-false or alternative-response 
2. Matching 
3. Multiple choice 


tests 
types 


B. 


118 Constructing Classroom Tests 


Essay test: 
A. Extended response 
B. Restricted response 


The objective test provides the pupil with a highly structured task which 
limits the pupil’s response to supplying a word, number, or symbol, or to 
selecting the answer from among a given number of alternatives. The essay 
test permits the pupil to respond by selecting, organizing, and presenting those 
facts he considers appropriate. Both types of tests serve useful purposes in 
measuring pupil achievement. The type to use in a particular situation is best 
determined by the learning outcomes to be measured, and by the unique advan- 
tages and limitations of each type. A common practice is to include both objec- 
tive test items and essay questions in a single test. ' 

If classroom tests are to provide valid information concerning pupil achieve- 
ment, their preparation must be guided by sound principles of test construction. 
In brief, these principles emphasize the importance of (1) considering the pur- 
pose of the test, (2) selecting the type of test item which best measures the 
learning outcome, (3) obtaining a representative sample of pupil behavior, 
(4) constructing test items of the proper level of difficulty, (5) eliminating 
extraneous factors which prevent the pupil from responding, (6) eliminating 
clues leading to the correct answer, and (7) making a test which contributes to 


improved teaching-learning practices. 


SUGGESTIONS FOR FURTHER READING 


Bloom, B. S. “Testing Cognitive Ability and Achievement,” Chapter 8 in N. L. Gage (ed.), 
Handbook of Research on Teaching. Chicago: Rand McNally & Co., 1963. 

Chauncey, H., and J. E. Dobbin. Testing: Its Place in Education Today. New York: — 
& Row, 1963. Chapter 3: “The Why and How of Achievement Tests.” Chapter 4: “Wha 
Makes a Good Test?” -nal 

Ebel, R. L., and Dora E. Damrin. “Tests and Examinations,” Encyclopedia of Education 
Research. 3rd edition, New York: Macmillan, 1960. Pages 1502-1510. P 

Lindquist, E. F. (ed.) Educational Measurement, Washington, D.C.: American Council ae 
Education, 1951. Chapter 5: Lindquist, E. F., “Preliminary Considerations in Objective 
Test Construction,” Chapter 6: Vaughn, K. W., “Planning the Objective Test.” 


Classroom Testing in Special Areas 


Ahmann, J. S., M. D. Glock, and H. L. Warde! 
Boston: Allyn and Bacon, 1960. Ch 
matics, and content areas. 


Arny, Clara B. Evaluation in Home Economics. New York: Appleton-Century-Crofts, 1953. 

Bradfield, J. M., and H. S. Moredock. Measurement and Evaluation in Education. New York: 
Macmillan, 1957. Chapters 10-13. Evaluation in language arts, social studies, science, 
mathematics, and performance areas. A 

Dressel, P. L., and Associates. Evaluation in Higher Education. Boston: Houghton Mifin, 


: p ; . š. iti a. 
1961. Chapters 4-7. Evaluation in social science, natural sciences, humanities, and com 
munication areas. 


Dutton, W. H. Eualuating Pupils’ 
Jersey: Prentice-Hall, 1964. 


berg. Evaluating Elementary School Pips 
apters 11-13. Evaluation in language arts, mathe 


Understanding of Arithmetic. Englewood Cliffs, New 


Principles and Procedures of Testing 119 


n. Measurement and Evaluation in the 


Gerberi 
erich, J. R., H. A. Greene, and A. N. Jorgense 
Chapters 15-24. Evaluation in each of 


ae School. New York: David McKay, 1962. 
Lado aysa school subjects. 

York. en Testing: The Construction an 
eg er ree 961. 

McCraw pe and M. R. Karnes. Measuring 
National Cor ill, 1950. Especially good for industrial education. f ; 

Bosk ere of Teachers of Mathematics. Evaluation in Mathematics. Twenty-sixth Year- 
Popham, ion ington, D.C.: The National Council, 1961. . . I 5 

Busin stelle L. (ed.) Evaluation of Pupil Progress in Business Education. American 

Socks N Education Yearbook, Vol. 17, National Business Teachers Association. New 
Willgoose New York University Book Store, 1960. Í : 

M se, C. E. Evaluation in Health Education and Physical Education. New York: 

cGraw-Hill, 1962. 


Test Bulletins 


Stodol 

a, 

ie and others. Making the Classroom Test, A 
ry Service Series, No. 4, Princeton, New Jersey: Educ: 


d Use of Foreign Language Tests. New 


Educational Achievement. New York: 


n and 


Guide for Teachers. Evaluatio. 
1961. 


ational Testing Service, 


Chapter 7 
constructing 
objective 
test items: 
simple forms 


—_ 1 


Each type of test item has its own unique characteristics . . . use Pim 
advantages .. limitations . . . and rules for construction. . . . ee = 
à O A “¿Dy sime 

are considered for the objective-test forms which measure relatively 


item, 
ple learning outcomes: (1) the short-answer item, (2) the true-false ite 
and (3) the matching exercise, 


The preliminary planning 
provides a sound basis for 
table of specifications clarifi 
to be measured, and the ge 
work within which to pro 
items, This step is crucial, 
test is determined by the 
actually called forth by the 
for the learning outcomes 


for a classroom test, discussed in the last eg 
developing a valid test of pupil achievement. 

es the subject-matter content and learning eam 
neral principles of test construction provide a n 
ceed. The next step is the actual construction of ter 
since in final analysis the validity of an achievemen 
extent to which the behaviors to be measured oly 
test items. Selecting test items that are sen gaae are 
to be measured, constructing items with technica 
defects, or unwittingly including irrelevant clues in the items can undermine 
all of the careful planning that has gone on before. 

The construction of good test items is an art. The skills it requires, however, 
are the same as those found in effective teaching. It requires a thorough grasp 
of subject matter, a clear conception of desired learning outcomes, a ge 
logical understanding of pupils, sound judgment, persistence, and a touch o 
creativity, The only additional requisite for constructing good test items is the 


120 


Objectiue Test ltems: Simple Forms 121 


skill icati z z š 
ae application of a wide array of simple but important rules and sugges- 
s.! These techniques of test construction are the topic of this and the next 
several chapters. 
In thi "o š : 
n this chapter we shall limit our discussi 


objecti 2 š 
jective test items, namely, the (1) short-answer item, 


on to the more simple forms of 
(2) true-false or alter- 


native- ; ; a š 
e-response item, and (3) matching exercise. These item types are treated 


together si 2 
gether since their use in classroom testing is restricted, almost exclusively, 
the knowledge area. The 


to th š 
aur measurement of simple learning outcomes 1n 
me iple-choice item and other methods of measuring more complex achieve- 
nt wi š : š 
t will be considered in the following chapters. 


SHORT-ANSWER ITEMS 


ny a item and completion item are both supply-type test items 
tally me be answered by a word, phrase, number, or symbol. They are essen- 
case e same, differing only in the method of presenting the problem. In the 

of the short-answer item a direct question is used, while the completion item 


consis x 
ts of an incomplete statement. 


EXAMPLES 
e steamboat? (Robert Fulton) 


Shor 
< i 
aaa What is the name ofthe man who invented th š 
gsi The name of the man who invented the steamboat is (Robert Fulton). 
(or) 
ft 


The (steamboat) was invented by Robert Fulton. 


ire included in this category are problems in arithmetic, mathematics, 
nee, and other areas, where the solution must be supplied by the pupil. 


Us 

el 

8 of Short-Answer Items 

The short-answer type of test item is suitable for measuring a wide variety 

relatively simple learning outcomes. The following outcomes and test items 
° 


ust: : ; 
Tate some of the common uses of this type of item. 


EXAMPLES 


Kno 
wledge o 
š termi r 
f terminology are called 


Anes ki r 
on a weather map which join p°? 


nts of the same barometric pressure 


m 
wledge of speci 
pecific facts o a term of (6) years. 


m à ç 
ember of the United States Senate 15 elected t 


n. 
a a of principles J 
What temperature of a gas is held constant whil 
will happen to its volume? (It will decrease) 

1 

1963) . Ahman: and NU D: ‘Glock, Evaluating Pupil Growth (Boston: Allyn and Bacon, 
ment | 1. H. Remmers, N. L. Gage, end I-E- Rummel, 4 Practical Introduction to Measure- 
and Evaluation (New York: Harper & Row, 1960). R. M. Thomas, Judging Student 


To, g 
hie sg York: Longmans, Green, 1960). R. M. W. Travers, How to Make Achievement 
ew York: Odyssey Press, 1950). e 


e the pressure applied to it is increased, 


122 Constructing Classroom Tests 


Knowledge of method or procedure 


i i iti ive? Š 
What device is used to detect whether an electric charge is positive or negative? (elec 
troscope) 


Simple interpretations of data 
How many syllables are there in the word Argentina? (4) 
In the number 612, what value does the 6 represent? (600) 
In the triangle below, what is the number of degrees in each angle? (60) 


4 4 


4” 


i i ing? 
If an airplane flying northwest made a 180 degree turn, what direction would it be heading 
(southeast) 


Interpretations of a more complex nature are obtained where the short-answer 
item is used to measure the ability to interpret diagrams, charts, graphs, and 
various types of pictorial data. 

There are even more notable exceptions to the general rule that short-answer 
items ard limited to measuring simple learning outcomes. These are in a 
of mathematics and science where the solutions to problems can be indicated 
by numbers or symbols. The following examples illustrate this use. 


EXAMPLES 


Ability to solve numerical problems 
Milk sells for $.26 a quart and 


$.88 a gallon. How many cents would you save on each 
quart of milk, if you bought it by 


the gallon? (4) 


Skill in manipulating mathematical symbols 


tee 3 (3b) 
b. bel + then x = (b —). 


Ability to complete and balance chemical equations 
Mg++ (2) HCI > (MgCl, + H.) 
(2)A1 + (6) HCl > (2A1Cl, + 3H.) 
—— r L 


For outcomes similar to those indicated in these last examples, there is no 
adequate substitute for the short-answer item. The behavior described in the 
learning outcomes is identical to the behavior called forth by the test items. 
To obtain correct answers, pupils must actually solve problems, manipulate 
mathematical symbols, and complete and balance equations. A3 

Attempts are sometimes made to measure such problem-solving activities 
with selection-type test items. This commonly results in test items which do not 
function as intended or which measure quite different learning outcomes. In 
the following multiple-choice items, for example, note how the division problem 
can be solved by working it backwards (multiplying 2 X 43, or merely 2 X 3) ? 
and how in the second problem the value of x can be determined by substituting 
each of the alternative answers in the equation on a trial-and-error basis. Such 


problems obviously do not call forth the problem-solving behavior we are 
attempting to measure. 


Objective Test Items: Simple Forms 123 


EXAMPLES 


£ 
If T + 76 = 10, then x equals 
A 16 


B 24 


© 32 


° D 48 
asss w nes are encountered when we substitute selection items measur- 
Ítems ee to “recognize balanced chemical equations for short-answer 
e a ring the ability to “complete and balance chemical equations. The 
matic š is a simple one requiring little more than a knowledge of arith- 

» while the latter one requires rather extensive knowledge of chemical 


Teactj 
nie their resulting products. — f 
Specific ] mays if the short-answet test item 1S most effective fo 1 i 
of the s Aa outcome, it should be used. We should not discard it for items 
comes bere type unless we are fairly certain that the same learning out- 
edn be measured. For many of the simpler learning outcomes; such as 
not qana of factual information, changing to some form of selection item will 
jectivity aa the validity of the measurement and will result in increased ob- 
Such as ‘ane ease of scoring. For some of the more complex learning outcomes 
in mathematics and science, however; discarding the short-answer test 


lt 
il ay mean a change in the learning outcomes being measured. In deciding 
er to use short-answer items OT some other item type, our best guide is 
learning outcome should be measured as 
hich is most appropriate for that 


r measuring a 


the followi 
"er ie general principle. Each 
purpose as possible and the test item type t 
e should be selected for use- 


swer Items 


Th a" 
e short-answer test item is one of the easiest to construct. This is partly due 
es usually measured with this type of 


to th 

e : 
item isa simple learning outcom' x t 
scien xcept for the problem-solving outcomes measured in mathematics and 
9 Su, the short-answer item is used almost exclusively to measure the recall 

€: ; r 

— information. 

teni 

that re important advantage of th 


the Ë 
pupa papil must supply the answ Í 
“Pil will get the correct answer by guessing. He must either recall the informa- 


ion 

S . 

še a ted or make the necessary computations to solve the problem pre- 

answe. to him. Partial knowledge, which might enable him to choose the correct 

n se a selection item, is insufficient for answering a short-answer test item 
y. 


Ady 
antages and Limitations of Short-An 


em arises out of the fact 


e short-answet it 
er. This reduces the possibility that the 


124 Constructing Classroom Tests 


There are two major limitations which restrict the use of the short-answer 
test item. One—their unsuitability for measuring complex learning outcomes— 
has already been mentioned. The other has to do with the difficulty of phir 
Unless the question is very carefully phrased, a variety of answers of varying 
degrees of correctness must be considered for total or partial credit: as s 
ple, a question such as “Where was George Washington born?” could resu E 
the name of the city, county, state, region, or continent. Although the teache 
had the name of the state in mind when he wrote the question, he could not 
dismiss the other answers as incorrect. Even when this problem is avoided, the 
scoring is contaminated by the pupil's spelling ability. If full or partial credit 
is taken off for misspelled words, the pupils’ test scores reflect varying E 
of knowledge and spelling skill. If spelling is not counted in the scoring, the 
teacher must still decide whether misspelled words actually represent the cor- 
rect answer. We are all familiar with misspellings that are so bad that it is dif- 
ficult to determine what the pupil had in mind. These complications make 
scoring more time consuming and less objective than that obtained with selec- 
tion-type items. . 

The limitations discussed above are less troublesome where the answer js to 
be expressed in numbers or symbols, as in problem solving in physical me 
and mathematics. Here, more complex learning outcomes can be measured, 


SE š š š ich 
spelling is not a problem, and it is usually easier to write test items to whi 
there is only one correct response. 


Suggestions for Constructing Short-Answer Items 


The short-answer item is subject to a variety 


it is considered one of the easiest to construct. 
aid in avoiding 


of defects, despite the fact 7 
The following suggestions wi 


possible pitfalls and will provide greater assurance that the 
items will function as intended. 


1. Word the item so that the re 
indicated earlier the answer should 
can be easily conveyed to the pup’ 
of the test and by proper phrasing 
of the question so that only one answi 


quired answer is both brief and definite. As 
be a word, phrase, number, or symbol. This 
ils through the directions at the beginning 
of the question. More difficult is the stating 
er is correct. 


EXAMPLES 


Poor: 


An animal that eats the flesh of other animals is (carnivorous). 
Better: 


An animal that eats the flesh of other animals is classified as (carnivorous) . 


The first version of this test item is so indefinite that it could be completed 
with answers such as “the wolf,” “a meat eater,” or even “hungry.” Asking the 
pupils to classify this type of animal, as called for in the improved version, 
provides a more definite structure to the problem and makes clear the type of 
response required. 
2. Do not take statements directly from textbooks to use as a basis for short- 
answer items. When taken out of context, textbook statements are frequently 


Objective Test Items: Simple Forms 125 


answer items. Note the vague- 


too : 
general and ambiguous to serve as good short- 
which was taken verbatim 


ne: i i i 
am of the first version of the following test item; 
om a chemistry textbook. 


P EXAMPLES 
fe, ie aegen. 

: Chlorine belongs to a group of elements that combine 
therefore called a (halogen) - 
aon would most likely respond to the first version of this test item with 
in th ord “gas,” since that is the natural state of chlorine and there is nothing 
who e statement to imply that the word “halogen” is wanted. The only eels 
ction ould be apt to supply the intended answer would be those who ha 
soa textbook statements. The revised version measures an importa 
a acne does not depend on the specific phraseology of pian ar 
stood v ay: uch items tend to discourage the pupils from developing litt eo F 
achiev. rs al associations based on textbook language and encourages them to 

3 oF e learning outcomes being measured. ` ie 
che direct question is generally more desirable than an ie s E 
nat - There are two advantages to the direct question form. First, it is mor 
tural to the pupils, since this is the usual method of phrasing questions in 
al es discussions. This is especially important to elementary Pupils 
Vides T exposed to short-answer tests. Second, the direct ga me PA A 
s structure to the situation and prevents much of the am i y : 
into items based on incomplete statements. Just the phrasing of 4 ques: 


with metals to form salt. It is 


tio; ' ' 
M seems to require us to clarify what it is we precisely want to know. 
P, EXAMPLES 

erte i h in (1962). 
Bet John Glenn made his first orbital flight around the eart : 

Š ‘er: When did John Glenn make his first orbital flight around the earth? (1962) ne 
x ke his first orbital flight around the earth? ( ) 


Th In what year did John Glenn ma 
€ first version of the item could, of course, 


Caps 

ule,” : pie 

Eon “i “Friendship Seven,” “space,” and similar anes ee 
iti i or me 

i orm forces us to indicate whether it 15 the time, place, 


int : 
any in knowing. The last version is merely a refinement which makes the 
estion even more specific and which naturally evolves from a consideration 


Of the « 

e “when” a i tion. 
spect of the revious question: . Sere, 

4. Where the answer is > be expressed in numerical units, indicate the ae 

° answer wanted. For computational problems, it is usually preferable to in T 

ate the units in which the answer is to be expressed. This will clarify the prod- 


em 
to the pupil and will simplify the scoring. 


be completed with “a space 
ers. Putting it in ques- 


P EXAMPLES 

oor; I 
If oranges weigh 5% oz. each, how much would a dozen oranges weigh? 

Bete, Anewer: (&1b- 4 oz.) ae 

ha how much would a dozen oranges weigh? 


7 oranges weigh 5% oz. each, 
nswer: (4) lb. (4) oz. 


126 Constructing Classroom Tests 


Unless the type of unit is specified, as in the revised version, correct answers 
will include 68 oz., 41⁄4 lbs., 4.25 lbs., and 4 lbs. and 4 oz. This adds unnecessary 
confusion to the scoring. f E 

Where problems do not come out even, it is also usually desirable to indicate 
the degree of precision expected in the answers. For example, specifying that 
the answers should be “carried out to two decimal places” or “rounded off to 
the nearest tenth of a per cent” makes clear to the pupil how far to carry his 
calculations. This will provide assurance that he reaches the degree of pre- 
cision desired and by the same token it will prevent him from wasting valuable 
testing time attempting to achieve a degree of precision that is not expected. 

There are some instances, especially in the area of science, when knowing 
the proper unit in which the answer is to be expressed and knowing the degree 
of precision to be expected are important aspects of the learning outcome to 
be measured. In such cases, the above suggestions must, of course, be modified. 

5. Blanks for answers should be equal in length and in a column to the giei 
of the question. If blanks for answers are kept equal in length, the length = 
the blank space does not supply a clue to the answer. In the poor version i 
the following items, note how the length of the blank restricts the possible 


P : or 
answers a pupil need consider. For the first item he needs a long word and f. 
the second item a short one. 


EXAMPLES 
Poor: What is the name of the part of speech that connects words, clauses, and sentences? 
(conjunction) . I 
What is the name of the part of speech that declares, asserts, or predicts some- 
thing? (verb) 
Better: 


What is the name of the part of speech that 
connects words, clauses, and sentences? 

What is the name of the part of speech that 
declares, asserts, or predicts something? (verb) 


(conjunction) 


Placing the blanks in a column to the right of the question, as shown in the 
improved version, makes scoring quicker and more accurate. 

6. Where completion items are used do not use too many blanks. If a state- 
ment is overmutilated the meaning will be lost and the pupil usually must resort 
to guessing what the teacher had in mind. Although some mutilated statements 
seem to measure rather complex reasoning abilities, such responses are more 
appropriate as measures of intelligence than achievement. 


EXAMPLES 
Poor: (Warm blooded) animals that are born (alive) and (suckle) their young are 
called (mammals) . 
Better: 


Warm blooded animals that are born alive 
and suckle their young are called (mammals). 

In the revised version of the above item, note also that the blank is at the 
end of the statement. This is a desirable practice, since the pupil is presented 


with a clearly defined problem before he comes to the blank. 


Objective Test Items: Simple Forms 127 


TRUE-FALSE OR ALTERNATIVE-RESPONSE ITEMS 


arative statement which the 


The di 

altern i i i 
ative-response item consists of a ecl 

incorrect, yes 


pupili 

- wom ° mark true or false. right or wrong; correct or 

only s te ayasa agree or disagree. and the like. In each case there are 

tien type is fre answers. Since the true-false option is the most common, this 

the watiations a ua referred to as simply the true-false test item. Some of 

and have ikan. owever, deviate considerably from the simple true-false pattern 
own distinct characteristics. For this reason, the more general 


categ, 
gor š 
y, alternatiue-response item, is preferred. 


Use 
B . 
of Alternative-Response Items 


Pro 
ing ee Tnet common use of the alternative-response 
ity to identify the correctness of statements of fact, definitions of 
and the like. For measuring such relatively sim- 
tive statement is used with any one of 


item is in measur~ 


term 

S 

ple 5 Wawa wisa of principles, 
ing outcomes, a single declara! 


Severa! 
l methods of responding. 


EXAMPLES 


Directi 
etone: i up” 
m a each of the following statements. If the statement is true, circle the “T. 
F f the statement is false, circle the “F.” 
t leaf is called chlorophyll. 


© g, 
@ @ The green coloring material in a plan 


2 " 
F The corolla of a flower includes petals and sepals. 
hich leaves make t 


the answer is yes, cir 


Demo. Photosynthesis is the process by w he food for a plant. 
s: Read each of the following questions. If cle the “Y.” If 
> Nou mae answer is no, circle the FN; 

Y Š 2. Te ms of 38 more than 19? 

© 3. If pe of 4/10 equal to 2/5? 

N 4. % of a number is 9, is the number smaller than 9? 


On Is 25% of 44 less than 12? 
e of 
the most useful functions o 


in 

Ë the pars 

° pile abili Zes: 
illustrate ihe — to distinguish 


f the alternative-respons? item is in measur- 
n. The following examples 


fact from opinio 


the statement is a fact, circle the 
“o.” 

st law of our country- 

n is the most important amendment. 

ects an individual from testifying 


irecti 
ions: 
9ns: Read each of th i 
ove of the following $ 
” If the statement is an opini 


o 

1. 
@ The Constitution of the United Sta 
o 


on, circle the 
2. Th tes is the highe: 
3. The first amendment to the Constitution | 
e fifth amendment to the Constitution prot 
© á against himself. 


- Other countries should adopt à constitution like that of the United States. 


If the statement is true, circle the “T 


ements- 
» Jf the statement is an opinion, circle 


irectio 
ns: 

Read each of the following stat 
the “F. 


“ statement is false, circle 
L. TH i 
Š The earth is a planet. 
ss i earth revolves arounc 
- There are no plants or anim 


1 the moon. 
als on Mars. 


128 Constructing Classroom Tests 


The above items measure a learning outcome which is of importance in all sub- 
ject matter areas. If a person is to think critically about a topic, he must first 
be able to distinguish fact from opinion. 

All too frequently true-false tests include numerous opinion statements to 
which the pupil is asked to respond merely true or false. This is extremely wn! 
trating, since there is no objective basis for determining whether a aaam 
of opinion is true or false. The pupil must usually guess what opinion the 
teacher holds and mark his answers accordingly. This, of course, is undesirable 
from all standpoints—testing, teaching, and learning. It is much better to have 
the pupil identify the statements of opinion as such. An alternative aoe gs 
is to attribute the opinion to some source. This makes it possible to mark 5 e 
statements true or false and provides a good measure of knowledge concerning 


the beliefs held by an individual, or the values supported by an organization or 
institution. 


EXAMPLES 
M > orale the 
Directions: Read each of the following statements. If the statement is true, circle 
“T.” If the statement is false, circle the “F.” 


TE i ë ter- 
T ® 1. Franklin D. Roosevelt believed that labor unions interfered with the free ente 
prise system in the United States, 


@ F 2. The American Federation of Labor favors the closed shop. 
T @ 3. The Supreme Court of the 


E," 1 
United States would support the principle of equa 
but separate facilities for th 


e education of different racial groups. 

Items such as those above 
if the opinion statements att 
pupil. The task then becom 
vidual, or group, and appl 

Another aspect of unde 
native-response item is th 
This type of item usually 
the pupil is to judge whet 


can become measures of aspects of understanding, 
ributed to an individual, or group, are new te the 
es one of interpreting the beliefs held by the indi- 
ying them to the new situation. 

rstanding that can be measured by the simple alter- 
e ability to recognize cause and effect relatisnships 
contains two true propositions in one statement an 
her the relationship between them is true or false. 


EXAMPLES 


eo Ë ou 
Directions: In each of the following statements, both parts of the statement are true. Y: 


š i S, 
are to decide if the second part explains why the first part is true. If it doe: 
circle the “Yes.” If it does not, circle the “No.” 


Yes AJ 1. Leaves are essential BECAUSE they shade the trunk of the tree. 
parts of a tree 


Yes (9 2. Whales are mammals BECAUSE 
No 3. Some plants do not 


need sunlight 


they are large. 
BECAUSE They get their food from other plants. 


The alternative-response item can also be used to measure some simple aspects 


of logic, as illustrated by the following items. 


Objective Test Items: Simple Forms 129 


EXAMPLES 
pes of the following statements. If the statement is true, circle the “T”; 
if it is false circle the “F.” Also, if the converse of the statement is true, circle 
is false, circle the “CF.” Be sure to give two an- 


Directions: 


the “CT”; if the converse 
® swers for each stat 
F statement. 
T CCF) 1. All trees are plants. 
T ase 2. All parasites are animals. 
@ F CF 3. All eight-legged animals are spiders. 
CF 4. No spiders are insects. 


m A criticism of the simple alternative-response type item is that a pupil 
Dicoret S to recognize a false statement as incorrect but still not know what 
not in saran example, when a pupil answers the following item false, it does 
issgñawarataji at he knows what particles of negative electricity are called. All 
s us is that he knows they are not called neutrons. This is a rather 


T® 
Parti š 
ticles of negative electricity are called neutrons. 


cr 
t > ai of knowledge, since there js an inestimable number of things 
ties, som cles of negative electricity are not called. To overcome such difficul- 
hen is teachers prefer to have the pupils change all false statements to true. 
is is done, the part of the statement it is permissible to change should 


e indicated. 


EXAMPLES 


s. If a statement is true, circle the “T.” 
derlined word to make 


fter the “F.” 


Directi 
Ons: Read cach of the following statement 
Tf a statement is false, circle the “F” and change the un 
T ® í the statement true. Place the new word in the blank space a 
@ F —(eleetrons) _ 1. Particles of negative electricity are called neutrons- 

i ical energy by means of 


ae Mechanical energy i 


t @ the generator. ate 
—store) _ 3. An electric condenser is used to generate electricity- 


dicated in the correction-type true- 
In addition to the 
statements which 


Uni 
e: . 
alse tres the key words to be changed are m 
increase i pupils are apt to rewrite the ent 
eviate in scoring difficulty, this frequently 
considerably from the original intent of 
ternative-Respo™ 
items are not, unfortunately, 
Í construction. This 


ire statement. 
leads to true 
the item. 
Ady, 

antages and Limitations of Al se Items 


e-response 


The 
advantages attributed to alternativ 
tly is ease © 


Ver 
Y vali 
h lid. One advantage cited most frequen 


fare resulted from the all-too-common practice of taking statements 
aB pa changing half of them to false statement and submitting the 
that eve to pupils as a true-false test- Such test items are frequently so obvious 
Dihet gets them correct or so ambiguous that even the better pupils are 
llems. T by them. In short, it is easy to construct poor alternative-response 

© construct unambiguous alternative-respons® items, which measure 


130 Constructing Classroom Tests 


significant learning outcomes, however, requires an extremely high degree of 
skill. 

A second advantage attributed to the alternative-response item, which is also 
more apparent than real, is that a wide sampling of course material can be 
obtained, Since a pupil can respond to many test items in a short period of 
time, it seems obvious that a large number of areas can be covered. Less obvious, 
however, is the fact that many types of subject matter do not lend themselves to 
alternative-response type items. True-false statements require course material 
that can be phrased in such a manner that the statements are true or false 
without qualification or exception. In all subject matter fields there are areas 
in which such absolutely true or false statements cannot be made. In some 
fields, such as the social sciences, practically all significant statements require 
some qualification. Only the most trivial statements can be reduced to absolute 
terms. 

One of the most serious limitations of the simple alternative-response item is 

in the types of learning outcomes that can be measured. As with the short- 
answer item, it is limited to the more elementary learning outcomes in the 
knowledge area. The main exceptions to this seem to be in distinguishing 
between fact and opinion, and in identifying cause-and-effect relationships. 
These two outcomes are probably the most important measured by this type 
of item. Most of the knowledge outcomes measured by the alternative-response 
item can be measured more effectively by other forms of selection items, espe- 
cially the multiple-choice form. 
f Another factor which limits the usefulness of the alternative-response item 
is its susceptibility to guessing. With only two alternatives, a pupil has a 50-50 
opportunity of selecting the correct answer on the basis of chance alone. Due 
to the difficulty of constructing items which do not contain clues to the answer, 
the pupil’s chances of guessing correctly are usually much greater than 50 per 
cent. With a typical 100 item true-false test, it is not unusual to have the lowest 
score somewhere above 80. Although an indeterminate amount of knowledge 
is reflected in such a score, many of the correct answers, beyond chance, can be 
accounted for by correct guesses guided by various clues that have been over- 
looked in constructing the items. A scoring formula utilizing a correction for 
guessing is frequently suggested as a solution for this problem. This formula 
takes into account only chance guesses, however, and does not include those 
guided by clues. In addition, such a scoring formula favors the aggressive indi- 
vidual willing to take a chance. When warned that there will be a penalty for 
guessing, he will continue to guess, using any clues available, and will do 
better than chance. The cautious student, on the other hand, will mark only 
those answers he is certain are correct and will omit many of the items he could 
mark correctly on the basis of clues and partial information. Thus, the scores 
tend to reflect personality differences as well as knowledge of the subject. 

The high likelihood of successful guessing on the alternative-response item 
has a number of deleterious effects. (1) The reliability of each item is low, 
making it necessary to include a large number of items to obtain a reliable 


Objective Test Items: Simple Forms 131 


measu Sy š š 
re of achievement. (2) The diagnostic value of such a test is practically 
to each item is meaningless. (3) The 


ble because of response sets. As noted 
dency to follow a certain pattern in 


ml, since analyzing a pupil's response 
ced of pupils’ responses is questiona 
eet asi. beak ae Is a consistent ten 
will stan agi items. In taking a true-false test, for example, some pupils 
CT Hrm tons thee items they don't know while others will 
Safont ma ar them false.” Thus, any given test will favor one response 
to the pur er and introduce an element into the test score which is irrelevant 
pose of the test. 

Bh Pon of the simple altern 
fee pem: wise to use this item type only w ' j ? š 
asuring the desired learning outcomes. This would include situations 

right, left; more, less; 


wher 
e there are only two possible alternatives (e.g. 
nguishing fact from 


<“ 

Who,” “ š int 

wa whom”: and so on) and special uses such as distir 
10 = x A A 
n, cause from effect, superstition from scientific belief, relevant from 

onclusions, and the like. 


nonre A i 
levant information, valid from invalid c: 


ative-response item are so serious that it 


here other items are inappropriate 


s š . 
uggestions for Constructing Alternative-Response Items 


Mas main task in constructing alternative-response items, such as the true- 
Se type, is that of formulating statements which are free from ambiguity 


and j ) 
s. irrelevant clues. This is an extremely difficult task and the only guidance 
ate spam $ : 3 : : : 
a be given is of a negative sort—that is, a list of things to avoid when 
l = the statements. 
sat ‘hea broad general statements, 
Provid road generalizations are false unless qua 
ides clues to the answer. 


if they are to be judged true or false. 
lified and the use of qualifiers 


p EXAMPLES 

oor: 7 | 

Poor; T @ The President of the United States 1Š elected to that office. P 
‘OF The President of the United States is usually elected to that ofhce. 


generally true but must be marked 


In i 
the above example, the first version 3S ! ; 
Vice-President taking office in 


alse 
ue there are exceptions, such as the Vic $ i Ae 
Mak, of the President’s death. In the second version, the qualifier “usua ly 
es the statement true but provides 4 definite clue called a specific determiner. 
> “often,” and “sometime,” which 


ese 

are are words such as “usually,” “generally,” ° h 

“n most likely to appear in true statements and absolute terms such as always, 

fen “all,” “none,” and “only” which are more apt to appear in false 
, k š 

ments, Although the influence of such clues can be offset by balancing their 

g 


$ i . . 
in true-and-false statements, the simplest solution seems to be to avoid the 
obviously false or must be qualified by 


Use 
e ee generalizations which ar 
1 P ppc determiners. 
unequivoc RA statements. 
ents of a Aa false, we so 
Tom a “bee which fit this criterion 
rning standpoint. 


statements which are 


pt to obtain 
turn to specific state- 


dvertently 
put which have little significance 


In an attem 
metimes ina 
peautifully 


132 Constructing Classroom Tests 


EXAMPLES 


Poor: (Q F Harry S. Truman was the 33rd President of the United States. 
Poor: T @ The United States declared war on Japan, December 7, 1941. 


The first item calls for a relatively unimportant fact concerning Truman's 
tenure as President. The second item expects the student to remember that the 
United States did not declare war until December 8. Such items cause students 
to direct their attention toward the memorization of minutiae at the expense 
of more general knowledge and understanding. I 

3. Avoid the use of negative statements, and especially double negatives. 
Negative words, such as “no” or “not,” tend to be overlooked by pupils and 
double negatives contribute to the ambiguity of the statement. Note the am- 
biguity in the relatively simple statement, using two negatives, below. 


EXAMPLES 


Poor: F None of the steps in the experiment was unnecessary. 
Better: F All of the steps in the experiment were necessary. 


Where it is imperative that a negative word be used, it should be underlined 
or put in italics so that pupils do not overlook it. 


4. Avoid long, complex sentences. As noted earlier, a test item should indi- 
cate wether a pupil has achieved the knowledge or understanding being m a 
ured. Long, complex sentences tend to also measure the extraneous factor o 


reading comprehension and therefore should be avoided in tests to measure 
achievement. 


EXAMPLES 


Poor: @ F Despite the theoretical and experimental difficulties of 


exact pH value of a solution, 

by the red color formed on litn 

Better: @ F Litmus paper turns re 
As in the above example, 
statement by eliminating no 
Where this is not possible, it 
in order to avoid complex sen 


determining the 
it is possible to determine if a solution is acid 
mus paper when it is inserted into the solution. 
d in an acid solution. 

it is frequently possible to shorten and simplify a 
nfunctional material and restating the main idea. 


might be necessary to change to another item form 
tence structure. 


5. Avoid including two ideas in one statement, unless cause-effect relationships 
are being measured. Some of the difficulties arising 


ideas in one statement are apparent in the followin 


many items of similar type a teacher actually 
In each instance, 


was true or false. 


from the inclusion of two 
g example. This is one of 
used in a biology examination. 
he asked the pupils to merely judge whether the statement 


EXAMPLE 


Poor: T ® A worm cannot see because it has simple eyes. 


This item is keyed 


“false” because a worm does not have simple eyes. How- 
ever, when this teach 


er asked one of his slow learners why he marked it false, 


— 


— 
—wÜ 


Objective Test Items: Simple Forms 133 


the pu 1 H 
il said, “ š 
pigs a id, “Worms can too see.” This of course highlights the fact that 
5 Ë " ae 
boxe “Thee get such items correct with misinformation of the most erroneous 
is so because the first proposition can be true or false, the second 


Propositio 
n can be true or false, and the relationship between them can be 


true o 
r false. Thus, when a pupil marks the item false, there is no way of 
The best solution 


determini n 

ta m A kara of the three elements he is responding to. 

to judge:the nae en to be to use only true propositions and to ask the pupils 

Gl gies ul or alsity of the relationship between them. Such items might, 
, also be divided into two simple statements, each containing a single 


idea, 

ae, eae is used, attribute it to some source, unless the ability to identify 

opinion canr eng specifically measured. As pointed out earlier, statements of 

how the a be marked true or false and it is unfair to expect pupils to guess 

to expect pu = will score such items. It is, of course, also poor teaching practice 

ing whether aoe to respond to opinion statements as statements of fact. Know- 
me significant individual or group supports or refutes a certain 

ing standpoint. 


9pini 
on ae 
» however, can have significance from a learni 


EXAMPLES 
Poo; 
rs 4 
Better: z F Adequate medical care can best be provided through socialized medicine. 
® The American Medical Association favors socialized medicine as the best 


means of providing adequate medical care. 
seful purpose in an 


The fi 
rst version, in the above example, may serve a U 
the truth or 


attitud 
e : x 
test, but there is no factual basis on which to decide 


alsit 
y of the statement. The second version is clearly false. 
ts should be approximately equal in 


T 
es Statements and false statemen 
Such stat ere is a natural tendency for ‘hp: longer because 
truth il must be precisely phrased to meet the criterion of absolute 
- This can be overcome by lengthening the false statements through the 
und in true statements. This will 


Use 

of find 

Meine phrases similar to those fo 
e length of statement as a possible clue to the correct answer. 


A he number of true statements and false statements should be approxi- 
ë on This will prevent response sets from unduly inflating or deflating 
to mark ya Os You will recall that some pupils have a consistency tendency 
tendenc true” when in doubt about an answer; while others have a consistent 

y to mark “false.” Neither response set should be favored by overloading 


e test with į 
with items of one type- ‘stroll t 
<i equal s ou e 


words 
consistently uses “exactly” the same num- 
il who is unable to answer some of the 


o vary the percentage of true state- 
no circumstance should 


true statements to 


n š 
given elie this suggestion the 
er, this a attention. If a teacher 
test i = provide a clue to the pup 
ments som, he best procedure seemš to be t 
© statem ewhere between 40 and 60 per cent. Under 
obtain 5i ents be all true or all false. Pupils who detect 
perfect scores on the basis of one guess. 


this as a possibility can 


134 Constructing Classroom Tests 


MATCHING EXERCISES 


In its traditional form, the matching exercise consists of two parallel columns 
with each word, number, or symbol in one column being matched to a word, 
sentence, or phrase in the other column. The items in the column for which a 
match is sought are called premises and the items in the column from which 
the selection is made are called responses. The basis for matching responses 
to premises is sometimes self-evident but more frequently must be explained in 
the directions. In any event, the pupil’s task is to identify the pairs of items that 
are to be associated on the basis indicated. For example, the pupil may be asked 
to identify the meaning of terms, as illustrated below. 


EXAMPLES 
Directions: On the line to the left of each phrase in Column A, write the letter of the 
word in Column B that best matches the phrase. Each word in Column B may 
be used once, more than once, or not at all. 
Column A Column B 
(G) 1. Name of the answer in addition problems. A Difference 
B Dividend 
LA) 2. Name of the answer in subtraction problems. C  Multiplicand 
D Product 
(D) 3. Name of the answer in multiplication problems. E Quotient 
F  Subtrahend 
(E) 4. Name of the answer in division problems. G Sum 


This matching exercise illustrates an 
more terms in Column B than are n 
The directions also indicate that an 


“imperfect match.” That is, there are 
eeded to match each phrase in Column A. 
item may be used once, more than once, oF 
prevent pupils from matching the final pair 


a . s i of 
First, the items in the list S 
x e 
all concerned with the nam N 
ity is necessary if a matching 
. . ri 
remise in Column A there a e 
s V 
the incorrect responses no 
ubt about the correct answet®- 
to minimize the Opportunity for successful guessing. 
Uses of Matching Exercises 


. . . . re- 
i i as illustrated above, is limited to the measu a 
ent of factual information based on simple associations. Wherever learnins 
. . . - 7 > 
o identify the relationship between two thing: d 
geneous premises and responses can be obtaine®: 


à ` d 
> 3 appropriate. It is a compact and efficient metho 
mng such simple knowledge outcomes. Examples of relationships con 


sidered important by teachers, in a variety of fields, include the following: 


Objective Test Items: Simple Forms 135 


Men 
me Achievements 
Parma Historical Events 
Rules Definitions 
Sybok Examples 
“Ses Concepts 
Foree W Titles of Books 
"woa ords English Equivalents 
Plants or Animals i ne 
eincicies Classification 
Obi Illustrations 
Bee Names of Objects 
Functions 


The 
matchi oe ah os 
ing exercise has also been used with pictorial materials in relating 


Picture: 
s and s. age, pè 
Regardless he ay and in identifying positions on maps, charts, and diagrams. 
o : i; A 
that of relati the form of presentation, however, the pupil's task is essentially 
restricts th, ing two things which have some logical basis for association. This 
e use of the matching exercise to a relatively small area of pupil 


achievement, 


Ady 
an 5 Gi 98 ` A 
tages and Limitations of Matching Exercises 


s its compact form, which 
ated factual material in a 
it frequently leads 
ding overemphasis 


Th 
e maj 
jor advantage of the matching exercise i 


makes ; 
S it f 
possible to measure a large amount of rel 


telativ 
to Moe time. This is a mixed blessing, however; since 
on the a ise’ of matching exercises and a correspon 
niise wont of simple relationships. ; _. 
vantage frequently cited for the matching exercise 18 ease of con- 
an be rapidly con- 


Struct; 
tion N í 
st . As with the alternative-response item, poor items © 
£ skill. Much of the difficulty 
t also serve 


ch premise mus 
other premises. Any lack of plausibility will 


e 

uce t 
e aa minber of possible choices and provide clues to the correct answer. 
ching exercise tends to have more such irrelevant clues than any other 


item 

ty 5 i 
Pe, with the possible exception of the true-false item. 

hing exercise have already been indicated. 


e main limitati 
von in limitations of the mate 
ricted to the measurement of factual information pased on rote mem- 
of irrelevant clues. Another 


Je to the presence 
also be mentioned. That is the difficulty of 
significant from the viewpoint of our 


xample, we might start out with a few 
which we feel all pupils should know. 
t becomes necessary to add the names 
Thus we find ourselves 
jn our original test plan 
ledge we had in- 


onse for ea 


asa pl ; 
ausible response for the 


It is 


9rizati 
actor, some it is highly susceptib 
naaa L hA tean should 
9bjective mogeneous material which is 
S and learning outcomes. For ë 


Breat scient: 

i si and their achievements; 

and achi © construct a matching item, i 

hearan fant of other, lesser known, scientists. 

g factual information which was not included 
ther aspects of know 


and 
which i 
is far less important than © 


136 Constructing Classroom Tests 


tended to include. In short, less significant material is introduced into the test 
because significant material of a homogeneous nature is unavailable. This is a 
common problem in constructing matching exercises and one not easily avoided. 
One solution is to begin with multiple-choice items, where each item can be 
directly related to a particular learning outcome, and switching to the matching 
form only when homogeneous material makes the matching exercise a more 
efficient method of measuring the same achievement. 


Suggestions for Constructing Matching Items 


Although the matching exercise has limited usefulness in classroom tests, 
whenever it is used special efforts should be made to remove irrelevant clues 
and to arrange it in such a manner that the pupil can respond quickly and with- 
out confusion. The following suggestions are designed to guide such efforts. 

1. Use only homogeneous material in a single matching exercise. This has 
been mentioned before and is repeated here for emphasis. It is without a doubt 
the most important rule of construction and yet the one most commonly violated. 
One reason for this is that homogeneity is a matter of degree and what is 
homogeneous to one group may be heterogeneous to another. For example, let 
us assume that we are following the usual suggestion for obtaining homogeneity 
and develop a matching exercise which includes only men and their achieve- 
ments. We might end up with a test exercise similar to the one below. 


EXAMPLE 


Directions: On the line to the left of each achievement listed in Column A, write the letter 


of the man’s name in Column B who is noted for that achievement. Each name 
in Column B may be used once, more than once, or not at all. 


Column A Column B 

LAL 1. Invented the telephone. À Alexander Graham Bell 

ç B Christopher Columbus 
(B) 2. Discovered America. € Jolin Glenn 
(Cc i r D Abraham Lincoln 
{C) 3. First United States astronaut to orbit the earth. E Ferdinand Magellan 

i F George Washington 
(F) 4. First President of the United States. G Eli Whitney 


Although the matching exercise in our example may be homogeneous for 
most pupils in the primary grades, the discriminations called for are so gross 
that pupils above that level see it as a heterogeneous collection of inventors, 
explorers, and presidents. Thus, to obtain homogeneity at higher grade levels, 
it would be necessary to have only inventors and their inventiohs in one match- 
ing exercise, explorers and their discoveries in another, and presidents and 
their achievements in another. At a still higher level, it might be necessary to 
limit matching exercises still further, such as inventors whose inventions are 
in the same specific area, in order to keep the material homogeneous and free 
from irrelevant clues. It should be noted that as we increase the level of dis- 
crimination called for in a matching exercise, homogeneous material of a signifi- 


Objective Test Items: Simple Forms 137 


— Bs eo increasingly difficult to obtain. Take the inventors for 
2. Include Ga, significant inventions are there in aay one specific area? 
pupil that res i unequal number of responses and premises, and instruct the 
make all of ri mses may be used once, more than once, or not at all. This will 
the likelihood i pene eligible for selection for each premise and will decrease 
premises are — guessing. Where an equal number of responses and 
usss ike use an each response is used only once, the probability for 
answer is S marge: responses correctly is increased each time a correct 
responses d ected, Odds for correct guessing increase as the list of available 
nthe asi orn and the final response, of course, can be selected entirely 
With i o this process of elimination. f I 
including A ki matching exercise, imperfect matching can be obtained by 
directions io tan or a few less, responses than premises. In either case, the 
aiidns g ould instruct the pupil that each response may be used once, more 
> or not at all. 


3. ' 
ñ ae the list of items to be matched b 
e right. A brief list of items js advantageous to both the teacher and the 


` =a the teachers standpoint, it is easier to maintain homogeneity in a 
iestien In addition, there is a greater likelihood that the various learning 
Since cate subject-matter topics will be ae in a een ae 
š ñ 4 
ist < exercise must be based z a sees m e 
tief list en F ye concentration in one area. a Ae 3 ; ° b 
Pproxi ables him to read the responses rapidly and withou confusion. 
ximately four to seven items in each column seems preferable. There 


certai 
ent should be no more than ten in either column. 
cing the shorter items on the right, as responses; 


effici 
lent test taking. This enables the pupil to read the longer pre 


t 
7 s scan rapidly the list of responses- ; ' 
order rrange the list of responses in logical order. Place words in alphabetical 
Pupil and numbers in sequence. This will contribute to the ease with which the 
can scan the responses in searc he correct answers. It will also 


Prey, hing for t 
teon the pupil from detecting possible clues due to the arrangement of the 
onses, 


rief and place the shorter responses 


also contributes to more 
mise first and 


EXAMPLE 
write the letter 


the event oc- 
or not at all. 


al event in Column A, 
period during which 
more than once, 


historic 


Dir 
ections: 

ons: On the line to the left of each 
he time 


from Column B which identifies t 


curred. Each date in Column B may be used once, 
(B) Column À Column B 
(A) L Boston Tea Party A 1765-1769 
(R) Z Repeal of the Stamp Act B 1770-1774 
{Cy 4. Enactment of Northwest Ordinance cC 1775-1779 
(Ay © Battle of Lexington D 1780-1784 
— 5. Ë xing 
Qy ooa < ae ment of Townshend Acts E 1785-1789 
G 7. First Continental Congress 


United States Constitution drawn up 


138 Constructing Classroom Tests 


This matching exercise also illustrates the use of fewer responses than 
premises and the desirability of placing the shortest items on the right. f 

5. Indicate in the directions the basis for matching the responses and premises. 
Although the basis for matching is rather obvious in most matching exercises, 
there are advantages in clearly stating the intended basis. For one thing, am- 
biguity and confusion will be avoided. For another, testing time will be aii 
since the pupil will not need to read through the entire list of premises an 
responses and then “reason out” the basis for matching. i . 

Special care must be taken when stating directions for matching items. Direc- 
tions which precisely indicate the basis for matching frequently become long 
and involved, placing a premium on reading comprehension. For younger 
pupils, it may be desirable to give oral directions, put an example on the black- 
board, and have pupils draw lines between the matched items rather than trans- 
fer letters. P 

6. Place all of the items for one matching exercise on the same page. This 
will prevent the disturbance created by thirty, or so, pupils switching the pages 
of the test back and forth. It also prevents pupils from overlooking the responses 


š i f 
appearing on another page, and generally adds to the speed and efficiency o 
test administration. 


SUMMARY 


The construction of classroom tests, 
which must be learned. It is not auto 
subject matter, a formulation of the 1 
psychological understanding of the me 
these are basic prerequisites. The abil 
requires, in addition, a knowledge of t 
struction and skill in their application. 


In this chapter techniques for constructing short-answer items, true-false or 
alternative-response items, and matching exercises have been considered. These 
simple forms of objective test items are restricted, almost entirely, to the meas: 
urement of knowledge outcomes. They are generally unsuitable for measuring 
understandings, thinking skills, and other complex types of achievement. 

The short-answer item requires pupils to supply the appropriate word, num- 
ber, or symbol to a direct question or incomplete statement. It can be used for 
measuring a variety of simple knowledge outcomes but it is especially useful 
problem-solving ability in science and mathematics. The ease 
with which short-answer items can be constructed and their relative freedom 
from guessing favors their use. However, the areas in which they can be effec- 
tively used are restricted by the relatively simple learning outcomes measured 
and by the fact that the scoring is conta 
degrees of magnitude. Where short-ans 
be stated clearly and concisely, 


like other phases of teaching, is an art 
matically derived from a knowledge of 
earning outcomes to be achieved, or a 
ntal processes of pupils, although all of 
ity to construct high quality test items 
he principles and techniques of test con- 


for measuring 


minated by spelling errors of varying 
wer items are used, the question must 
be free from irrelevant clues, and require an 


Objective Test Items: Simple Forms 139 


answer which i : 
which is both brief and definite. Problems requiring only a number 
adaptable to the short-answer form. 


e pupil to select one of two possible 
false item but there are numerous 


o 7 
oe for an answer are particularly 
AMN pina cin item requires th 
Fade — common form is the true 
Where only ae com type is used for measuring simple knowledge outcomes 
ns ugat ernatives are possible, or where the ability to identify the 
the ability to di oe of fact is important. It is also adaptable to measuring 
and-effect s. omg fact from opinion and the ability to recognize cause- 
SHBG innccowane = ups. The difficulty of constructing items free from clues, 
esias the | significant learning outcomes: the susceptibility of this type to 
value pane i reliability of each item; and the general lack of diagnostic 
ther Yen + y limits its use. It might well be restricted to those areas where 
totem fate es are inappropriate. When used, special efforts must be made to 
ements which are free from ambiguity, specific determiners, and 


clues 
of vari 
ous types. 
columns of phrases, words, 


basis. Examples of items 
dates and historical 
e. The nature of the matching exercise 
fy the relationship between two things. 
pe which can be used to measure 
time. Its limitations include the 
£ finding homogeneous 
1 is available, in- 


arranging the shorter responses 


f two parallel 
ched on some 
and achievements, 


uapa: exercise consists o 
included in symbols which must be mat 
events, ter matching exercises are men 
limits it ip and definitions, and the lik 
or this re a the ability to identi 
a large ie = use, it is a compact item ty 
difficulty = er of relationships in a short tir 
material of here irrelevant clues and the difficulty of An 
significant nature. Where homogeneous materia 


cludi 
ing A 
on i items in one column than the other, 
ight and in logical order, and indicating clearly the basis for matching 


will 
all : š ° 
contribute to the effectiveness of the matching exercise. 


URTHER READING 


SUGGESTIONS FOR F 
1963. Chapter 3: “Con: 


York: Harper & Row, 


struction 


Green 
» J. A. Teacher-Made Tests. New 


and U. 
Odell, Ç pay Informal Objective Tests.” 
- W. How to Improve Classroom Testing. Revised edition, Dubuque, Iowa: William 


G, 
hei 1958. Chapter 8: “pirect-Recall and Completion Tests.” Chapter 10: “Alter- 
tanley, J a Incorrect-Statement Tests.” Chapter 11: “Matching Tests.” 
1964, Ch, Measurement in Today's Schools. Englewood Cliffs, New Jersey: 
T orndik, apter 7: “Constructing Specific Types of Objective Tests.” . 
duc z R. Lọ and Elizabeth Hagen- Measurement and Evaluation in 
ation, New York: John Wiley & Sons, 1961. Chapter 4: “Preparing 


Prentice-Hall, 


Psychology and 
Objective Tests.” 


Th 
Ustrative Test Items 


Gerh ; 

er: 

s R. Specimen Objective Test Items: A Guide to Achie 
ork: Longmans, Green, 1956. 


vement Test Construction. 


Chapter 8 
constructing 
objective test 
items: multiple- 
choice form 


Objective test items are not limited to the measurement of simple learn- 
ing outcomes. . . . The multiple-choice item can measure at both the 
knowledge and understanding levels. . . . It is also free of many of the 
limitations of other forms of objective items. 


The multiple-choice item is generally recognized as the most widely applicable 
and useful type of objective test item. It can more effectively measure many of 
the simple learning outcomes measured by the short-answer item, the alter- 
native-response item, and the matching exercise. In addition, it can measure a 
variety of the more complex outcomes in the knowledge and understanding 
areas. This flexibility, plus the higher quality of the items usually attained with 
the multiple-choice form, has led to its extensive use in achievement testing. 


CHARACTERISTICS OF MULTIPLE-CHOICE ITEMS 


A multiple-choice item consists of a problem and a list of suggested solutions. 
The problem may be stated in the form of a direct question or an incomplete 
statement and is called the stem of the item. The list of suggested solutions may 
include words, numbers, symbols, or phrases and are called alternatives. The 
pupil is typically requested to read the stem and the list of alternatives and to 
select the one correct, or best, alternative. The correct alternative in each item 
is called merely the answer, while the remaining alternatives are called dis- 
tracters. These incorrect alternatives receive their name from their intended 


function. That is, to distract those pupils who are in doubt about the correct 
answer. 


140 


Objective Test Items: Multiple-Choice Form 141 


Whether to use a direct question or incomplete statement in the stem depends 


E several factors. The direct question form is easier to write, is more natural 
a younger pupils, and is more apt to present a clearly formulated problem. 
n the other hand, the incomplete statement is more concise and if skillfully 


phrased, it too can present a well-defined problem. A common procedure is to 


Start each stem as a direct question, shifting to the incomplete statement form 


only when the clarity of the problem can be retained and greater conciseness 
achieved, 


EXAMPLES 


Direct-question form: 
n which one of the following cities is the capital of Californi 
A Los Angeles 
Sacramento 
C San Diego 
A D San Francisco 
omplete-statement form: 
© capital of California is located in 
A Los Angeles. 
Sacramento. 
San Diego. 
D San Francisco. 


a located? 


¥ 
rect answer. The capital 


se. All other alternatives 
answer type 


In the above examples, there is one absolutely cor 


California is located in Sacramento and nowhere el 
this is known as the correct- 


of 


aed wrong. For obvious reasons, 
iple-choice item. 
ot all knowledge can be stated in such precise terms that there is only one 
k nlite correct capons: In fact, when we get beyond the simple aspects of 

nowledge, represented by questions of the who, what, when, and where variety, 
“Ole varying degrees of acceptability are the rule rather than the excep- 
the; reveal a number of 


tion. Q dt 

` Questions of the why variety, for example, tend to atl 

Possible reasons, some of which are clearly better than the others. Likewise, 
: a P 

Questions of the how v ariety usually reveal several possible procedures, some 


o s š Š 
Which are clearly more desirable than the others. Measures of > was = 
Š ese areas, then, become a matter of selecting the best answer from a F = 

E i - 
Leones of varying degrees of correctness. This best-answer type of multiple: 


Choice item ; 
ce item is illustrated below- 
EXAMPLES 
B 
est-answer t 


hich 
Capita] 


ype: : 
One of the following factors contri selection of Sacramento as the 
of California? 

Central location 

Good climate 

Good highways 

Large population 


buted most to the 


142 Constructing Classroom Tests 


(or) 

Which one of the following factors is given most consideration, when selecting a city for 
a state capital] ? 

@ Location 

B Climate 

C Highways 

D Population 
What is the most important purpose of city zoning laws? 

A Attract industry 

B Encourage the building of apartments 

Protect property values 
D Provide school “safety zones” 


The best-answer type of multiple-choice item tends to be more difficult than 
the correct-answer type. This is due partly to the finer discriminations called 


for and partly to the fact that such items are used to measure learning outcomes 
of a more complex nature. 


USES OF MULTIPLE-CHOICE ITEMS 


The multiple-choice item is the most versatile type of test item available. It 
can measure a variety of learning outcomes from the simple to the complex 
and it is adaptable to most types of subject-matter content.’ It has such wide 
applicability and so many specific uses that many standardized tests use multiple- 
choice items exclusively.” It is obvious that all of the specific uses of the multiple- 
choice item cannot be illustrated. We shall confine ourselves, here, to its use 
in measuring some of the more typical learning outcomes in the knowledge and 
understanding areas. The measurement of more complex outcomes, using modi- 


fied forms of the multiple-choice item, will be considered in the following 
chapter. 


Measuring Knowledge Outcomes 


Learning outcomes in the 
subjects and multi 
that illustrative e 


knowledge area are so prominent in all school 
ple-choice items can measure such a variety of these outcomes 
xamples are endless. Here, we shall present some of the more 


typical uses of the multiple-choice form in measuring knowledge outcomes com- 
mon to most school subjects, 


Knowledge of Terminology. A simple but basic learning outcome meas- 
ured by the multiple-choice item is that of knowledge of terminology. For this 
purpose, the pupil can be requested to show his knowledge of a particular term 
by selecting a word which has the same meaning as the given term or by select- 


VE L Furst, Constructing Evaluation Instruments (New York: Longmans, Green, 1958). 


R. L. Thorndike and E. Hagen, Measurement and Evaluation in Psychology and Education 


(New York: John Wiley & Sons, 1961). R. M. W. Travers, How to Make Achievement Tests 
(New York: Odyssey Press, 1950). 


? This practice is not re 


commended for classroom testing. Despite the wide applicability 
of the multiple-choice item 


n there are learning outcomes, such as the ability to organize and 
present ideas, that cannot be measured with any form of selection item. 


Objective Test Items: Multiple-Choice Form 143 


te definition of the term. Special uses of a term can also be measured, by 
aving the pupil identify the meaning of the term when used in context. 


EXAMPLES 
Which one of the following words has the same meaning as the word egress? 
Á Depress 
B Enter 
Exit 
D Regress 


Whi 
ich one of the following statements best defines the word egress? 


A An expression of disapproval 
An act of leaving an enclosed place 
5 Senses to a higher level 
What i oe to a lower level i " . “The astronaut hopes he can 
heed eant by the word egress in the following sentence: 
e€ a safe egress”? 

Á Separation from the rocket 
Reentry into the earth’s atmosphere 
Landing on the water 
Escape from the space capsule 


Knowledge of Specific Facts. Another le 


Subj š s 
it bjects is the knowledge of specific facts. It i: i wn i 
Provides a necessary basis for developing understandings, thinking skills, 


i . : . 

Sa other complex learning outcomes. Multiple-choice items designed to aa 

i an facts can take many different forms but questions of the who, what, 
en 


arning outcome basic to all school 


s important in its own right and 


» and where variety are typical. 


EXAMPLES 


Wh 
o š i ? 
Was the first United States astronaut to orbit the earth in space: 


® Scott Carpenter 

John Glenn 

Virgil Grissom 

Alan Shepard 

a the name of the missile w 
flight around the earth? 

Atlas 

Mars 

Midas 


first United States astronaut into 


A hich launched the 


Orbita] 


ee Š x 
d after the first United States orbital flight 


Atlantic Ocean 
Caribbean Sea 
Gulf of Mexico 
Pacific Ocean 


144 Constructing Classroom Tests 


Knowledge of Principles. This is also an important learning outcome n 

š Š a S e 

most school subjects. Multiple-choice items can be constructed to — 
knowledge of principles as easily as those designed to measure knowledge o 


specific facts. The items appear a bit more difficult but this is because principles 
are more complex than isolated facts. 


EXAMPLES 


According to the principle of capillary action, fluids 

A enter solutions of lower concentration. 

B escape through small openings. 

C pass through semipermeable membranes. 

rise in fine tubes. 9 

Which one of the following principles of taxation is characteristic of the federal income tax! 

A The benefits received by an individual should determine the amount of his tax. 

A tax should be based on an individual's ability to pay. 

C All citizens should be required to pay the same amount of tax. 


à d- 
D The amount of tax an individual pays should be determined by the size of the fe 
eral budget. 


Knowledge of Methods and Procedures. Another common learning a 
come readily adaptable to the multiple-choice form is knowledge of metho 
and procedures. This includes such diverse areas as knowledge of laboratory 
procedures; knowledge of methods underlying communication, computation t 
and performance skills; knowledge of methods used in problem solving; know ° 
edge of governmental procedures; and knowledge of common social practices: 
The following test items illustrate a few of these uses in different school subjects- 


EXAMPLES 


Which one of the 


š š š ; is most 
following methods of locating a specimen under the microscope is n 
desirable? 


; š P i ë just- 
A Start with coarse adjustment up and with eye at eyepiece turn down coarse adj 
ment. 


Start with coarse adjustment down and with eye at eyepiece turn up coarse adjust- 
ment. 
C Start with coarse adjustment in center and with eye at eyepiece turn up and down 
until specimen is located. 
To make treaties, the President of the United States must have the consent of the 


A Cabinet. 
B House of Representatives. 
Senate. 


D Supreme Court. 
Alternating electric current is changed to direct current by means of a 
A condenser. 
B generator. 
rectifier. 
D transformer. 
If you were making a scientific study of a problem, your first step should be the 
@ collection of information about the problem. 
B development of hypotheses to be tested. 


Objective Test Items: Multiple-Choice Form 145 


C desi 

et of the experiment to be conducted. 

election of scientific equipment. 
with our illustrative uses of multiple- 
ledge outcomes. As you develop items 
ther uses will occur to you. 


We h 
oie rad merely scratched the surface 
ths 1 ems in the measurement of know 
particular school subjects you teach, many o 


Measuri 
uring Outcomes at the Understanding Level 


The majori 
are eaa a of multiple-choice test items constructed by classroom teachers 
ited to the knowledge area. There are a number of reasons for this in- 


cludi 
d his ease with which such items can be constructed and the fact that 
teachers te teaching is concerned with si ame’ near Many 
they believe chee a= items to the knowle e ma owever, gaT 
relatively s; all objective-type items are restricte o the measurement o 
types 4 ae learning outcomes. While this is true of most of the other 
measure objective items, the multiple-choice item 15 especially adaptable to the 
eee ement of more complex learning outcomes. The examples below illustrate 
e in the measurement of various aspects of understanding. 
t E anne the following illustrative items, it is important to keep in mind 
only pa items measure learning outcomes beyond that of factual knowledge 
ae the applications and jnterpretations are new to the pupils. Any specific 
ia ations or interpretations of knowledge can, of course, be taught directly 
a ‘ed as any other specific fact is taught. Where this is done and the test 
z H contain the same problem situations and solutions used in teaching, it is 
‘ Fie that the pupils can be given credit for no more than the mere retention 
factual knowledge. To measure understandings an element of novelty must 
© included in the on items. For i poses, it is necessary to assume 


ot novelty exists in the examp 
if bility to Apply Facts and Prin 
a Pupils understand a fact or princip 

Pplication in a situation which is new 


Illustrative pur 


les that follow. 
inciples. A common method of determining 


Je is to ask them to identify its correct 


to the pupil. 


EXAMPLES 


Which one of the following is an example of @ chemical element? 
A Acid 
B Sodium Chloride 
Di D Water 

irections: In each of the e word that makes the sentence 


correct. 


following sentences circle th 


which 


1. This is the boy 


whom 


asked the question. 


2. Thisisthedog Wh? he asked about. 
whom 


146 Constructing Classroom Tests 


Which one of the following best illustrates the principle of capillarity? 
®© Fluid is carried through the stems of plants. 
B Food is manufactured in the leaves of plants. 
C The leaves of deciduous plants lose their green color in winter. 
D Plants give off moisture through their stomata. 
Pascal’s law can be used to explain the operation of 
À electric fans. 
hydraulic brakes. 
C levers. 
D syringes. 
Which one of the following best illustrates the law of diminishing returns? 
A The demand for a farm product increased faster than the supply of the product. 
B The population of a country increased faster than the means of subsistence. 
C A machine decreased in utility as its parts became worn. 
® A factory doubled its labor force and increased production 50 per cent. 


Ability to Interpret Cause-and-Effect Relationships. Understanding may 
also be measured by asking pupils to interpret various relationships between 
facts. One of the most important relationships in this regard, and one common 
to most subject-matter areas, is the cause-and-effect relationship. Understanding 
of such relationships can be measured by presenting the pupil with a specific 


cause-and-effect relationship and asking him to identify the reason which best 
accounts for it. 


EXAMPLES 


Bread will not become moldy as rapidly if placed in a refrigerator because 
cooling retards the growth of fungi. 
B darkness retards the growth of mold. 
C cooling prevents the bread from drying out so rapidly. 
D mold requires both heat and light for best growth. 


There is an increased quantity of carbon monoxide produced when fuel is burned in @ 
limited supply of oxygen because 


A carbon reacts with carbon monoxide. 

carbon reacts with carbon dioxide. 

C carbon monoxide is an effective reducing agent. 
D greater oxidation takes place. 


Investing money in common stock provides protection against loss of assets during inflation 
because common stock 


A pays higher rates of interest during inflation. 


B provides a steady but dependable income despite economic conditions. 
C is protected by the Federal Reserve System. 


increases in value as the value of a business increases. 


Ability to Justify Methods and Procedures. Another phase of understand- 
ing that is important in various subject-matter areas is that concerned with 
methods and procedures. A pupil might know the correct method or the correct 
sequence of steps in carrying out a procedure, without being able to explain 
why it is the best method or sequence of steps. At the understanding level we 
are interested in the pupil’s ability to justify the use of a particular method 
or procedure. This can be measured with multiple-choice items by presenting 


Objective Test Items: Multiple-Choice Form 147 


the pu i i 
a pil with several possible explanations of a m i 
Tu am p a method or procedure and asking 


EXAMPLES 


Why i 
a mea lighting necessary in a balanced aquarium? 
z ie need light to see their food. 
ish take in oxygen in the dark. 
5 ae expel carbon dioxide in the dark. 
Wirdt nts grow too rapidly in the dark. 
armers rotate their crops? 
: To conserve the soil. 
e s make marketing easier. 
D a provide for strip cropping. 
Why is Lamas more uniform working conditio: 
A E i used in the process of changing cotto 
B the proves the texture and firmness. 
Ci "ose: the nutlike odor. 
s emoves the brownish-yellow color. 
t speeds up the process. 


ns throughout the year. 
nseed oil to a solid fat? 


ng can be measured by single multiple- 
examples, a series of multiple-choice 
more adaptable to the measure- 
Il be illustrated in the following 


Alt F 
TA various aspects of understandi 
eee as illustrated in the above 
ased on a common set of data is even 


Ment of 
complex achi items wi 
chapter. p ievement. Such i 


ONS OF MULTIPLE-CHOICE ITEMS 


ian advantage of the multiple-choice item has already been mentioned 
achieve strated. It is one of the most widely applicable test items for measuring 
can cee While it can measure various types of knowledge effectively, it 

o measure a variety of complex learning outcomes. In addition to this 


Breat i 
Sreater flexibility, it is free from some of the common $ 
igui d vagueness which frequently are 


Asti 

ia = other item types. The ambiguity an a 

Sreater n the short-answer item are b the alternatives provide 

in Se ee to the situation. In the following examP 

Problem education, note how the vague short-answer item 

answ when put in multiple-choice form. The short-answer i b. 
ered in many different ways but the multiple-choice item restricts the pupil's 


res 
Ponse a 
to a specific area. 


A 
DVANTAGES AND LIMITATI 


les, taken from a test 
becomes a clear-cut 
item could be 


P, EXAMPLES 

Oor; E 
Better. Drinking alcohol generally results in increased —Əə 
` eased 


Drinking alcohol generally results in incr 
A alertness. 
B attention. 
© confidence. 
D  self-consciousness- 


148 Constructing Classroom Tests 


The need for homogeneous material which includes a series of related ideas, 
a factor causing the greatest difficulty in constructing matching items, is he 
wise avoided with the multiple-choice item. Since each item measures a single 
idea, it is possible to measure one or many relationships in any given area. 
Use of the best-answer type multiple-choice item also circumvents one of the 
main difficulties associated with the true-false item—that of obtaining statements 
which are true or false without qualification. This makes it possible to measure 
learning outcomes in the numerous subject-matter areas where solutions to sk 
lems are not clearly true or false but vary in degree of appropriateness. Another 
advantage of the multiple-choice item over the true-false item is the greater 
reliability per item. Because the number of alternatives is increased from two 
to four or five the opportunity for guessing the correct answer is reduced pro- 
portionately. ‘ 

Two other desirable characteristics of the multiple-choice item are worthy © 
mention. First, it is practically free from response sets.* That is, pupils do jo 
have a tendency to favor a particular alternative when they don’t know re 
answer. Second, the use of a number of plausible alternatives makes the ee 
amenable to diagnosis. The nature of the incorrect alternatives selected 4 
pupils provides clues to factual errors and misunderstandings that need cor 
rection. 

The wide applicability of the multiple-choice item, plus its unique SA 
makes it easier to construct high quality test items in this form than in apy sU) 
the other objective forms. This does not mean that good multiple-choice Lene 
can be constructed without effort. But for a given amount of effort, multiple- 
choice items will tend to be of a higher quality than short-answer, true-false, 
or matching-type items in the same area.” za 

Despite its superiority, the multiple-choice item does have limitations. First 
it shares certain limitations with all other paper-and-pencil tests. It is limye 
to learning outcomes at the verbal level. The problems presented to pupils are 
verbal problems, free from many of the irrelevant factors present in natura 
situations. The alternative solutions to problems pupils are asked to consider A 
verbal alternatives, free from the emotional concomitants of alternative 5010- 
tions in natural situations. The applications pupils are asked to make P 
verbal applications, free from the personal commitment necessary for applica 
tion in natural situations. In short, the multiple-choice item, like other pam 
and-pencil tests, measures whether the pupil knows or understands what to m 
when confronted with a problem situation, but it cannot determine how t I 
pupil will perform in an actual situation. Second, the multiple-choice item share 
a basic limitation with other types of selection items. Since it requires selection 
of the correct answer, it is not well adapted to the measurement of some problem- 
solving skills in mathematics and science, and it is inappropriate for measurins 


*R. L. Ebel, “Writing the Test Item,” Educational Measurement, ed. E. F. Lindquist 
(Washington, D.C.: American Council on Education, 1951). 

+L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1900 

s R. M. W. Travers, How to Make Achievement Tests (New York: Odyssey Press, 1950) - 


Objective Test Items: Multiple-Choice Form 149 


Re to organize and present ideas. Third, the multiple-choice item has a 
uses not common to other item types. That is, the difficulty of locating a 
sufficient number of incorrect but plausible distracters. This difficulty diminishes 


consi : x - : " 2 
siderably, however, as experience 1S obtained in constructing such items. 


SUGGESTIONS FOR CONSTRUCTING MULTIPLE-CHOICE ITEMS 


ong general applicability and the superior qualities of multiple-choice test 
tion m realized most fully, obviously, when care is taken in their construc: 
. This involves the formulation of a clearly stated problem, the identification 
of plausible alternatives, and special efforts to remove irrelevant clues to the 
answer, The following suggestions provide more specific maxims for this 
Purpose: ne 
oo stem of the item should be meaningful by itself an ser an 
choice PEAN too frequently the stems of test items place 3 ci aoe 
altern N Sre incomplete statements which make little sense until a the 
colle “hi have been read. These are not multiple-choice itms but rather F 
¿saq ion of true-false statements placed in multiple-choice form. A es 
a Tucted multiple-choice item presents a definite problem in the stem, whic 
Meaningful without the alternatives. Compare the stems in the two versions 


of t š 
he test item in the following examples. 


d should present a 


P, EXAMPLES 
oor; 
T: South America 
A isa flat, arid country. 
B imports coffee from the United States. 
C hasa larger population than the United States. 
Better. was settled mainly by colonists from Spain. 
* Most of South America was settled by colonists from 
A England. 
B France. 
C Holland. 
® Spain. 
es the stem of the 


stem not only improv 


Formulati 
mulating a definite problem in the | 
the alternatives. In the above example, 


em bu f 

t it also h desirable effect on 

no as a desirable eftec À ç ETE 
¿ote that the alternatives in the first version are concerned with widely dissimilar 


eas. This heterogeneity is possible because of the lack of structure provided by 
in stem. In the second version. the clearly formulated problem in the stem 
Ce: 
S the alternatives to be more homogeneous: a ae 
a good check on the adequacy of the problem statement is to n 
s natives and read the stem by itself. It should be complete enough to serve 
a 
. 4 short-answer item. Starting € 


8 t. ach item stem as a direct uestion. and shift- 
o the j q > itt 
omplete statement form on 


Pro: id ly when greater conciseness is possible, 
x st ellectlvi t o! aining a C early for lated bl 

le most effecti lear mulate! robilem. 
PA ctive me hod for bt g y P 


be The item stem should include as much of the item as possible and should 
free of irrelevant material. This will increase the probability of a clearly 


(New York: Odyssey Press, 1950). 


it 


°R. M w ya 
- W. Travers, How to Make Achievement Tests 


150 Constructing Classroom Tests 


stated problem in the stem and will reduce the reading time required. The fol- 
lowing examples illustrate how the conciseness of an item is increased by 
removing irrelevant material and by including in the stem those words re- 
peated in the alternatives. It should also be noted that to obtain the conciseness 
of the final version, it was necessary to shift to the incomplete-statement form. 


EXAMPLES 


Poor: Most of South America was settled by colonists from Spain. How would you ac- 
count for the large number of Spanish colonists settling there? 
A They were adventurous. 
They were in search of wealth. 
C They wanted lower taxes. 
D They were seeking religious freedom. 
Better: Why did Spanish colonists settle most of South America? 
A They were adventurous. 
They were in search of wealth. 
C They wanted lower taxes. 
D They were seeking religious freedom. 
Best: Spanish colonists settled most of South America in search of 
A adventure. 


wealth, 


C lower taxes. 
D religious freedom. 


There are a few notable exceptions to this rule. In testing problem-solving 
ability, irrelevant material might be included in the stem of an item to deter- 
mine if pupils are capable of identifying and selecting that material which is 


relevant to the solution of the problem. Similarly, repeating common words in 
the alternatives is sometimes necessary for gramm 


; A r 
atical consistency or greate 
clarity. 


3. Use a negatively stated item stem only when significant learning outcomes 
require it. Most problems can and should be stated in positive terms. This avoids 
the possibility of pupils overlooking the no, not, least, and similar words used 
in negative statements. In most instances, it also avoids the measurement of rela- 
tively insignificant learning outcomes. Knowing the least important method, 
the principle which does not apply, or the poorest reason is seldom related to 
important learning outcomes. We are usually interested in pupils learning the 
most important method, the principle which does apply, and the best reason. 

Teachers sometimes go to ridiculous extremes to use negatively stated ne 
because they appear more difficult. The difficulty of such items, however, resides 


in lack of sentence clarity rather than the greater difficulty of the concept beinë 
measured. 


EXAMPLES 
Poor: Which one of the following states is not located north of the Mason-Dixon line? 
A Maine 
B New York 


C Pennsylvania 
Virginia 


Objective Test Items: Multiple-Choice Form 151 


Better: Which one of the following states is located south of the Mason-Dixon line? 
A Maine 
B New York 
C Pennsylvania 
@ Virginia 
Both versions of this item measure the same specific knowledge. However, 
Some pupils who can answer the second version correctly will select an incorrect 


al : y Š 
ternative on the first version merely because the negative phrasing confuses 


t P 4 5 Š ege 
hem. Such items thereby introduce factors which contribute to the invalidity 


of the test, 


h negatively stated items are generally i o 
Š vhere they are useful. These are mainly in areas where the wrong intor 
“sea or wrong procedure, can have dire consequences. In the health area, 

example, there are practices to be avoided because of their harmful nature. 
n shop and laboratory work, there are procedures which can damage equip- 


eg K a Ë 

se and result in bodily injury. In driver training there are 4 number of 

nsafe practices to be emphasized. Where the avoidance of such potentially 

armful practices are emphasized in teaching, they might well receive a corre- 
ly stated items. When 


spond; a 
faite emphasis in testing through the use of negativel t 
is e item should be made obvious to the pupil. 


to be avoided, there are occa- 


the negative aspects of th 


EXAMPLES 
Poor; Which one of the following is not a safe driving practice on icy roads? 
A Accelerating slowly 
Jamming on the brakes 
C Holding the wheel firmly 
Bei. D Slowing down gradually 
É it is not safe to 


When driving on icy roads, not 
À accelerate slowly. 
@ jam on the brakes. 
C hold the wheel firmly. 
D slow down gradually. 
T the first version of the item the “not” is easily overlooked, in “ag go 
Š e 
uw would tend to select the first alternative and ie ne soe by any 
ond versi = » would be over ookei 
: Tsion, it is i able that the 
pu , it is impro y 
pi because it is underlined and placed near the end of the m "e 
of a All of the alternatives should be grammatically consister a w a change 
n the item. In the illustrative item in the previous section, note how 3 g 
a u m- 
the stem necessitated a change in the alternatives 1" order to maintain gra 
a consistency. This rule is not rely to perpetuate proper 
ram : 
mar usage, however. Its main func 


tion is to prevent irrelevant clues from 
Creeping ; ] consistency of the 
co Ding into the item. All too frequen y 
Trect an 


tly the grammatica 
Tesult 


“not 


presented me: 


t of the distracters is neglected. Asa 


swer is given attention while tha ‘ h š 
e erammatically inconsistent with the stem 
g 


andar some of the alternatives ar 
r ° : oe 
e thereby obviously incorrect answers. 


152 Constructing Classroom Tests 


EXAMPLES 


Poor: An electric transformer can be used 
A for storing up electricity, 
to increase the voltage of alternating current. 
C it converts electrical energy into mechanical energy. 
D alternating current is changed to direct current. 
Better: An electric transformer can be used to 
À store up electricity, 
® increase the voltage of alternating current. 
C convert electrical energy into mechanical energy. 
D change alternating current to direct current. 


Similar difficulties arise from lack of attention to the tense of verbs, to the 
Proper use of the articles “a” and “an,” and to other common sources of gram- 
matical inconsistency. Since most of these errors are the result of carelessness, 
they can be easily detected by a careful reading of each item before assembling 
the items into a test. 

5. An item should contain only one correct or clearly best answer. Including 
more than one correct answer in a test item and asking pupils to select all of 
the correct alternatives has two major shortcomings. First, such items are usually 
no more than a collection of true-false items presented in multiple-choice form. 
They do not present a definite problem in the stem and the selection of answers 
requires a mental response of true or false to each alternative rather than a com- 
parison and selection of alternatives. Second, since the number of alternatives 


selected as correct answers varies from one pupil to another there is no satis- 
factory method of scoring. 


EXAMPLES 


Poor: The state of Michigan borders on 
@ Lake Huron. 
B Lake Ontario. 
@ Indiana. 
D Illinois. 
The state of Michigan borders on 
A Lake Huron. 
B Lake Ontario. 
C Indiana. 
D Ilinois. 


Better: 


ISIS) 


F 

® 
F 

@ 


The second version of this item 
expected. He is to read ea 
Thus, this is not a four-a 


makes clear to the pupil the type of response 
ch alternative and decide whether it is true or false. 
Iternative multiple-choice item. It is a series of four 
statements each of which has two alternatives—true or false. This second ver- 
sion, which is called a cluster-type true-false item, not only clarifies the nature 
of the mental process involved but it also simplifies the scoring. F xch statement 
in the cluster can be considered one point and scored as any other true-false 


Objective Test Items: Multiple-Choice Form 153 


item is s 

A, B, i ces how would you score a pupil who selected alternatives 
E ey femal i JE version? Would you give him two points because he 
aliss ke also a t| 7 two answers; would you give him only one point be- 
ints fecus on one incorrect alternative; or would you give him no 
rs ovaluarë Das Risin ee incorrectly to the item as a whole? How would 
EGS border on A to alternative D? Assume that he knew Illinois did 
mAGsitain, and left iy and therefore did not select it, or assume that he was 
ori handle the it blank. There is no method of scoring which will satisfac- 
Yetsion, should se problems. Multiple-response items, like the one in the first 

, should be avoided or converted to the true-false form. 


There j 
e is: ; $ ‘ast 
another important facet of this rule concerning single answer multiple- 


choice i 

ties in me — the answer must be one that can be agreed upon by authori- 

interpretation js 4 —. type item is especially subject to variations of 

taken to be c ne isagreement concerning the correct answer. Care must be 

Tewording ertain that the answer is clearly the best answer. Frequently a 
g of the problem in the stem will correct an otherwise faulty item. 


Po EXAMPLES 
or: Whi 
? l 
hich one of the following is the best source of heat for home use? 

A Coal 
B Electricity 
C Gas 

Better. D Oil 

s, which one of the following js the most 


nan part of the United State 
source of heat for home use? 
@ Coal 
B Electricity 
C Gas 
be ah D Oil 
fended T st version of the item, several di 
S Correct depending on whether the i ° 
© criteri Ai accessibility. The second version avoids this pro 
- Ite on of “best” explicit. 
Ware oe used to measure understanding $ 
asure tila novelty. The construction o 
Situations a d outcomes at the understanding 
Ut not too f and skillful phrasing. The situations mu 
items co SE removed from the illustrative examples 
of Cour ntain problem situations identical to those use 
the pr i respond on the basis of memorized answers. On the other hand, if 
: eae situations contain too much novelty, some pupils may respond 
= siako merely because they lack necessary factual information concerning 
ms used, Asking a pupil to apply the law of supply and demand to 


Som, 

© ph 

Se of banking, for example, would be grossly unfair, if he had not had 
us opportunity to study banking olicies and practices. He may have a 


fferent alternatives could be de- 
“pest” refers to cost, efficiency, 


blem by making 


hould contain some novelty but be- 
f multiple-choice items which 
level requires a careful choice 
st be new to the pupils 
used in class. If the test 


d in class, pupils can, 


154 Constructing Classroom Tests 


good understanding of the law of supply and demand but be unable to demon- 
strate his understanding because of his unfamiliarity with the particular situ- 
ation selected. I hl 

The problem of too much novelty can usually be avoided by selecting a, 
from the everyday experiences of the pupils, by including in the stem of a 
item any unique factual information needed, and by phrasing the item so thai 
the type of application or interpretation called for is clearly understood. i 

7. All distracters should be plausible. The purpose of a distracter is to dis 
tract the uninformed away from the correct answer. To the pupil who has not 
achieved the learning outcome being tested, the distracters should be at least as 
attractive as the correct answer and preferably more so. In a properly sd 
structed multiple-choice item, each distracter will be selected by some pupils. 
If a distracter is not selected by anyone, it makes no contribution to the func- 
tioning of the item and should be eliminated or revised. a 

One factor contributing to the plausibility of distracters is their komugeneiy: 
If all of the alternatives are homogeneous with regard to the knowledge being 
measured, there is much greater likelihood that the distracters will function 4 
intended. Whether alternatives appear homogeneous and distracters plausib s. 
however, also depends on the age level of the pupils. Note the difference 1" 
homogeneity in the following two items. 


EXAMPLES 
Poor: Who discovered the North Pole? 
A Christopher Columbus 
B Ferdinand Magellan 
Robert Peary 
D Marco Polo 
Better: Who discovered the North Pole? 
A Roald Amundsen 
D Richard Byrd 
Robert Peary 
D Robert Scott 


The first version would probably appear homogeneous to pupils at the primary 
level because all four choices are the names of well-known explorers. nn 
pupils in higher grades would eliminate alternatives A, B, and D as possible 
answers because they would know these men were not Polar explorers. They 
might also recall that these men lived several hundred years before the — 
Pole was discovered. In either case, they could quickly obtain the correct ee 
by the process of elimination. The second version includes only the names "w 
Polar explorers, all of whom were active in Polar explorations at sas, 
the same time. This homogeneity makes each alternative much more plausible 


" . > s, i S he 
and the elimination process much less effective. It, of course, also increases t 
level of difficulty of the item. 


= š 5 " Ë š st 
In selecting plausible distracters, the learning experiences of the pupils mu 


not be ignored. In the above illustrativ 


: in 
e item, for example, the distracters 1" 
the second version would not be plausibl 


e to pupils if Robert Peary was the only 


Objective Test Items: Multiple-Choice Form 155 


Pola: 
r explorer they had studied. Obviously, distracters must be familiar to 


pupils bef 
or ; ‘ : 5 
e they can serve as reasonable alternatives. Less obvious is the rich 
=e š à 
the pupils’ learning experiences. 


Si 

w ss distracters provided by 

duting the se Lg E errors of judgment, and faulty reasoning that occur 

tionally sound peed learning process provide the most plausible and educa- 

Tomine seeped pen available. One way to tap this supply is to keep a 

anSWer Eset to se ge errors. A quicker method is to administer a short- 

Poids a ae pup s and tabulate the errors which occur most frequently. This 
ries of incorrect responses which are especially plausible because 


th š 
avoided = associations between the stem 
Du6anse ‘ete ati a word in the correct answer W 
Šssopiations sh s or sounds like a word in the stem of the 
to select the = aa never permit the pupil who lacks the necessary achievement 
e included gomas answer: However, words similar to those in the stem might 
9n rote me in the distracters to increase their plausibility. Pupils who depend 
tó, ithe co mory and verbal associations will then be led away from, rather than 
Weather vrech ADS WENs The following item. taken from a fifth-grade test on a 
unit, illustrates the incorrect and correct use of verbal associations 


etwee 
n the stem and the alternatives. 


and the correct answer should be 
ill provide an irrelevant clue 
item. Such verbal 


EXAMPLES 
t to find out about a tornado 


Poor: š 
oti one of the following agencies should you contac! 
ing in your locality? 
A State Farm Bureau 
® Local Radio Station 
C United States Post Office 
Better. D United States Weather Bureau 
I should you contact to find out about à tornado 


iss one of the following agencies 
ng in your locality? 
A Local Farm Bureau 
Nearest Radio Station 
C Local Post Office 


D United States Weather Bureau 
ween “locality” and “local” 


e association bet 
nd version; this verbal association 1S 
attractive choices. It should be noted 
distracters is overdone, 


In 

the ý 

Ovides first version of the item, th 

used in tit unnecessary clue. In the seco 
wo distracters to make them more 


at if thi 
t Ç i 
Pupils his use of irrelevant verbal assoc! ove 
Will soon get wise and avoid alternatives with pat verbal associations. 


9 

Bis ee length of the alternatives should not provide a ie to the 

onger ince the correct answer usually needs to be qualified, it tends to be 
length or the distracters unless a special effort is made to control the relative 
istracte the alternatives. Where the correct answer cannot be shortened, the 

desirable f can be expanded to the desired length. Lengthening the distracters is 

‘ox another reason also: The added qualifiers and increased specificity 


req 
uent] s 
y contribute to their plausibility- 


pri 


ations in the 


156 Constructing Classroom Tests 


EXAMPLES 


Poor: What is the major purpose of the United Nations? 
To maintain peace among the peoples of the world. 
B To establish international law. 
C To provide military control. 
D To form new governments. 
Better: What is the major purpose of the United Nations? 
@ To maintain peace among the peoples of the world. 
B To develop a new system of international law. 


C To provide military control of nations which have recently attained their 
independence. 


D To establish and maintain democratic forms of government in newly formed 
nations. 


The best we can hope for in equalizing the length of the alternatives for a 
given test item is to make them approximately equal. Consequently, we still have 
the problem of the length of the correct answer. Although it should not be 
consistently longer than the other alternatives, neither should it be consistently 
shorter nor consistently of median length. The relative length of the correct 
answer should vary from one item to another in such a manner that no dis- 
cernible pattern is available to provide a clue to the answer. 

10. The correct answer should appear in each of the alternative positions 
approximately an equal number of times, but in random order. Some teachers 
seem to have a tendency to bury the correct answer in the middle of the list of 
alternatives. As a consequence, the correct answer appears in the first and last 
positions far less than it does in the middle positions. This, of course, provides 
an irrelevant clue to the alert pupil. 

In placing the correct answer in each position approximately an equal number 
of times, care must be taken to avoid a regular pattern of responses. A random 
placement of correct answers is easily attained with the use of any book. For 
each test item, open the book at an arbitrary position, note the number on the 
right-hand page, and place the correct answer for that test item as follows: 


If page number Place correct 
ends in answer 
1 First 
3 Second 
5 Third 
7 Fourth 
9 Fifth 


This random placement of the correct answer should be used for all multiple- 
choice items, with the possible exception of some numerical answers. Numbers 
should be placed in numerical order and this sometimes dictates where the cot 
rect answer must fall. 

11. Use special alternatives such as “none of the above” or “all of the 
abosi sparingly. The phrase “none of the above” or “all of the above” is some- 
times added as the last alternative in multiple-choice items. This is done to 


Objective Test Items: Multiple-Choice Form 157 


force th ; 
ë 3 
difficult Lom to consider all of the alternatives carefully and to increase the 
a ed t e items. All too frequently, however, these special alternatives are 
ppropriately. In fact, their limitations are such that there are relatively 


aay a their use is appropriate. 
Prillivheckotne bce of the above” is restricte 
ikon absolute 1 sd oe consequently to the measurement of factual knowledge 
priate in s si ar s of correctness can be applied. It is clearly inappro- 
iemaisa sce er type items, since the pupil is told to select the best of several 
Use of “no arying degrees of correctness. I 
ond » Ae the above” is frequently recommended for items measuring 
Pie Basil i ill in mathematics, and spelling ability. These learning out- 
kinde eN — not be measured by multiple-choice items, however, 
sa ee (i k measured much more effectively by short-answer items. Where 
nothing e pa is used in such situations, the item may be measuring 
rather inade than a pupil’s ability to recognize jncorrect answers. This is a 
quate basis for judging his computa’ 


The al tional skill or spelling ability. 
alternative “none of the above” should be used only when the measure- 
ires it. As W 


m P 
... ignificant learning outcomes requ! ith negatively stated item 
avoided Sa are situations where procedures or practices are to be clearly 
“s no san health, or other reasons. Where knowing what not to do is 
is irpo; nane of the above” might be appropriately applied. When used for 
ate pote 10085, of course, also be used as an incorrect answer a proportion- 
The aE traes Ah ji 
use of “all of the above” is fraught with such difficulties that it might 


etter E yak ak 
that Pa discarded as a possible alternative. When used, some pupils will note 
first alternative is correct and select it without reading further. Other 

i correct and thereby 


Pupi ; 
hae anole that at least two of the alter ‘ 3 
mark the š all of the above” must be the answer. In the first instance, pi 
e os en incorrectly because they did not read all of the alternatives. a 
Cede, instance, pupils obtain the correct answer on the basis of partia 
NN Both types of response prevent the item from functioning as intended. 
S Ót use multiple-choice items where other item types are more appro- 


Priat I 
e. Where various item types can serve 3 purpose equally well, the multiple- 
ed becaus' ior qualities. 


Choice ; 

Ss oe should definitely be favor se of its many super 

r at fe situations, however, where the multiple-choice form : pa ely 

ations . less suitable than other item types. In certain 5a em-so ving situ- 

are clea REEN and science, for example, supply-tyPe sl E items 

Opinion) y superior. Where there are only two possible responses (e.g., fact or 
, the alternative-response item js more appropriate. Where there is a 


suffe; 
cient : š 
number of homogeneous itemš ible distracters Íor each, 


d to the correct-answer type 


SU 


a match; but few plaus 
age ing exercise might be more suitable. Although we should take full advan- 
of the wide applicability of the multiple-choice form, we should not lose 
ted earlier—that is, select the 


Sight 

š, of : x E Á 
ttem ç a basic principle of test construction C1 

Ype which measures a learning outcome most directly and most effectively. 


158 Constructing Classroom Tests 


SUMMARY 


The multiple-choice item consists of a problem and list of alternative solu- 
tions. The pupil responds by selecting the alternative which provides the cor- 
rect or best solution to the problem. The incorrect alternatives are called 
distracters, since their purpose is to distract the uninformed pupil from the 
correct response. The problem can be stated as a direct question or an eres 
plete statement. In either case, it should be a clearly formulated problem which 
is meaningful without reference to the list of alternatives. 

The multiple-choice form is extremely flexible. It can be used to measure h 
variety of learning outcomes at the knowledge and understanding levels. ae 
edge outcomes concerned with vocabulary, specific facts, principles, and metho ; 
and procedures can all be measured with the multiple-choice item. Aspects o 
understanding, such as the application and interpretation of facts, principles, 
and methods, can also be measured with this item type. Many other, more 
specific, uses occur in particular school subjects. ae: 

The main advantage of the multiple-choice item is its wide applicability r 
the measurement of various phases of achievement. It is also free of many 7 
the limitations of other forms of objective items. It tends to present a more 
well-defined problem than the short-answer item, it avoids the need for iame 
geneous material required by the matching item, and it reduces the clues n 
susceptibility to guessing which are characteristic of the true-false item. In add z 
tion, the multiple-choice item is relatively free from response sets and is usefu 
in diagnosis. 

Its limitations derive mainly from the fact that it is a selection-type paper- 
and-pencil test. It measures problem-solving behavior at the verbal level only. 
Since it requires selection of the correct answer, it is inappropriate for measur 
ing learning outcomes requiring the ability to recall, organize, or present pai 

The construction of multiple-choice items involves the formulation of a well- 
defined problem in the stem of the item, the selection of one correct or clearly 
best solution, the identification of several plausible distracters, and the avoid- 
ance of irrelevant clues to the answer. Items used to measure learning outcomes 


; ) 
at the understanding level must also include some (but beware of too much 
novelty. 


SUGGESTIONS FOR FURTHER READING 


Furst, E. J. Constructing Evaluation Instruments. New York: Longmans, Green, 1958. Chap- 
ter 10: “Constructing Choice-Type Items.” & 

Lindvall, C. M. Testing and Evaluation: An Introduction. New York: Harcourt, Brice 
World, 1961. Chapter 7: “Constructing Tests for Use in the Elementary School.” Chap- 
ter 8: “Constructing Tests for Use in the Secondary School.” 

Travers, R. M. W. How to Make Achievement Tests. New York: Odyssey Press, 1950. Chap- 
ter 4: “Objective-Type Test Questions: Best-Answer or Multiple-Choice.” 

Wood, Dorothy A. Test Construction. Columbus, Ohio: Charles E. Merrill Books, 1960. 
Chapter 7: “Constructing Objective Test Items.” 


Objective Test Items: Multiple-Choice Form 159 


Illustrative Test Items 
Selected Test Items in American Govern- 


Anderson, H. R., E. F. Lindquist, and H. D. Berg. 
Social Studies, Washington, D.C.: Na- 


ment. Revised edition, National Council for the 
tional Education Association, 1950. 

Anderson, H. R., E. F. Lindquist, and H. Stull. Selected Test Items in American History. 
Revised edition, National Council for the Social Studies, Washington, D.C.: National 
Education Association, 1957. 

Anderson, H. R., E. F. Lindquist, and D. K. 
Revised edition, National Council for th 
Education Association, 1960. 

Bloom, B. S. (ed.) Taxonomy of Educati 

G New York: Longmans, Green, 1956. g 

erberich, J. R. Specimen Objective Test Items: A Guide to Achievement Test Construction. 


New York: Longmans, Green, 1956. 


Test Items in World History- 


Heenan. Selected 
Washington, D.C.: National 


e Social Studies, 


onal Objectives: Handbook I, Cognitive Domain. 


measuring complex 
achievement: the 
interpretive exercise 


5 Fi Z MY. wa 


y 2 ; n the 
Complex achievement includes those learning outcomes based 0 


k -hing skills 
higher mental Processes, such as .. . understandings . . . thinking „plž 
j à SH m 
` ° + and various problem solving abilities. . . . Many aspects of co 
achievement can be measured objectively. 


We have alread 


x 
A P omple 
y had some experience with the measurement of comp 

achievement, since 


re- 
: ; mes 
this category encompasses all those learning A hort- 
PEP 5 es 
quiring more than mere retention of factual knowledge. The use of th cenie 
š Š aiii ; ci 
answer item to measure problem-solving abilities in mathematics and Ki Hon: 
: sje " elal 
of the true-false item to measure the ability to recognize = r aii 
A x e Tiia h ers 
ships, and of the multiple-choice item to measure various aspects of un aA 
ñ w ; i ra 
ing, all illustrated the measurement of complex achievement. These ae pjer 
however, were limited to the use of single, independent test items of the 4s ai 
" eL ele ; a. i n 
tive type. Greater range and flexibility in measuring complex achieveme 
be attained by using more complex forms of objective test items. ee: 
Š a s i š ve 
A variety of specific learning outcomes are included in complex achie 
Following are some typical examples: 
Ability to apply a principle 
Ability to interpret relationships 
Ability to recognize and state inferences f 
Ability to recognize the relevance of information 
Ability to develop and recognize tenable hypotheses 
Ability to formulate and recognize valid cone wie) a 
Ability to recognize assumptions underlying conclus 
Ability to recognize the limitations of data 
Ability to recognize and state significant problems 
Ability to design experimental procedures 


160 


Complex Achievement: The Interpretive Exercise 161 


ave been variously classified under such 
, critical thinking, scientific think- 
There is general agreement that 
ses constitute some 


on and similar learning outcomes, h 
i ieee as understanding, reasoning 
these a inking, and problem solving. 
OF Gerace = be aig based on on higher mental proces 
sa ie e n outcomes = wit x" ; a ae 
entirely to i easurement of complex ac ae pe = legate Hie 
ecte essay test. However, during the Eight-Year Study, a num er o 
the dire testa for measuring complex learning outcomes were developed under 
ction of R. W. Tyler.? Since that time numerous modifications and 


adaptati 3 He 

em have appeared—many of them to meet specific and limited pur- 
Si " : ; 
es. The most promising form for measuring a variety of complex learning 


9Uutco: ; M à 
mes, in most school subjects, is the interpretive exercise. 


NATURE OF THE INTERPRETIVE EXERCISE* 


An i 3 š a oe 
interpretive exercise consists of a series of objective items based on a 


co < A 
mmon set of data. The data may be in the form of written materials, tables, 
es of related test items may also take 


os > of the multiple-choice or alternative- 
ted with a common set of data, it 
ing outcomes. Pupils can be 
to recognize valid conclusions, to appraise 
plications of data, and the like. 
he common set of materials used in interpretive exercises provides assur- 
ance that all pupils are confronted with the same task. It also makes it possible 
> control the amount of factual information given to the pupils. We can give 
em as much or as little information as w€ think desirable in measuring their 
ng the ability to interpret 


achi Ë 

chievement ofa specific learning outcome. In measurl r 

mathematical data, for example, we can include the formulas needed or require 
a 


° pupils to supply them. In other areas, we can provide definitions of terms, 
Meanings of symbols and other specific facts or we can expect pupils to provide 
them, This aO male it possible to measure various degrees of proficiency 


an s 
any particular area. 


are presen 


Te: 
. SPonse variety. Since all pupils i 
£ complex learn 


is # 
ape to measure a variety o 
ass to identify relationships in data, 

mptions and inferences, to detect proper ap 


THE INTERPRETIVE EXERCISE 


As with other objective items; there are so many forms and specific uses of 
the interpretive exercise that it is impossible to illustrate all of them. What we 
Shall do here is present representative examples of this item type as applied to 
ntal Processes,” 


1960). 
Appraising and Recording Student Progress (New York: 


FORMS AND USES oF 


1D. H. Russell, “Higher Me ` Encyclopedia of Educational Research. (3rd 
edition, New York: Macmillan, 

io R. Smith and R. W- Tyler 

rpe 

r zi X Rov, DIle ig the Test Item,” Educational Measurement, ed. E. F. Lindqui 
(Washington, D.C.: American Council on Education, 1951). r quist 

3 Variations of this item type are 
and “master-list items-” 


also called “the classification exercise,” “key. type it $ 
ü -type items,” 


162 Constructing Classroom Tests 


the measurement of complex learning outcomes in a variety of school subjects 
at the elementary and secondary levels. Diflerent types of introductory ma- 
terial and different methods of responding will also be used, to illustrate the 
great flexibility of the interpretive exercise. The references at the end of this 
chapter will provide additional illustrative exercises for your guidance. 


Ability to Recognize the Relevance of Information 


This learning outcome is important in all subject-matter areas and can be 
measured at all levels of instruction. The illustrative exercise presented below 
was prepared for third-grade pupils. An example at the high school level may 
be found in Chapter 3 (page 54). 


EXAMPLE 
Bill lost his overshoe on the way to school. He wanted to put a notice on the bulletin hoard 
so other children could help him find it. Which of the following sentences tell something 
that would help children find the overshoe? 
Directions: Circle yes if it would help. Circle no if it would not help. 
no 1. The overshoe was black. 
yes CoD 2. It was very warm. 
GED no 3. It was for his right foot. 
yes GoD 4. It was a Christmas present. 
yes 0) 5. It was nice looking. 
GED no 6. It had a zipper. 
GED no 7. It had a grey lining. 


Ability to Apply Principles 


š . 5 he 
The application of principles may be shown in many different ways. In t 


; ° š z W i situa- 
following examples, pupils are asked to identify principles that explain a $ 
tion and to recognize illustrations of a principle. 


EXAMPLE I 


; iee as chi 
Mary Ann wanted her rose bush to grow faster so she applied twice as so pri 
fertilizer as was recommended, and watered it every evening. About a month later she 
that the rose bush was dying. 


emical 


ane : Š ë wary ix exnisining: wha fhë 
Directions: Which of the following principles would be necessary in explaining why 


ag ircle the 
ry, circle the “N,” if unnecessary, circle 1 


rose bush was dying. If a principle is ne 

“y” F 

N @ 1. A chemical compound is changed into other compounds by taking up the € 
ments of water. 

®) U 2. Semipermeable membranes permit the passage of fluid. 

N 3. Water condenses when cooled. : Ont 

U 4. When two solutions of different concentration are separated by a porous partiti 

their concentration tends to equalize. r . jen tlie 

N @ 5. In the leaves of plants, carbon dioxide and water of the air combine under 
influence of light to form carbohydrates. 


Je- 


EXAMPLE II 
Directions: Read the principle and the statements following it. If a statement describes 7 
condition which illustrates the principle, place a check (X) in the space to the left of the 
statement. 


" 


Complex Achievement: The Interpretive Exercise 163 


Princi 
nciple. If 
. the de i 
mand for a commodity or service is relatively constant, decrease in its 


supply will į 
( he ee its market value. 
. e stock K 
(X) World hl has shown a general upward — a 


2. F 
» Fresh iis 
(C) fruits and vegetables cost more when n 


price of stocks since 


ot in season. 
ten years ago- 


3: Medi 
(X) edical costs are higher now than they were 
World War l. 


4. Th 
š he price : 
price of automobiles increased rapidly during 


Abili 
ity to = 2 
Recognize Warranted and Unwarranted Generalizations 


e in the interpretation of data. 
hich conclusions are sup- 
hich are neither sup- 


This | 

earni š 

JA S outcome is of central importance 

Ported by is pupils should be able to determine w 

Ported n he data, which are refuted by the data, and w 
or refuted by the data. 


Example 


TE PERSONS FROM MOTOR VEHICLE 


MORTALITY OF WHI 
STATES, 1957-58 


THE UNITED 


= a ACCIDENTS IN 

Age Period I Death Rate per 100.000 
ee A 
(Years) Males Females 

All ages 32.9 11.1 

14 10.5 8.0 

5-14 10.4 5.4 

15-19 54.2 16.4 

20-24 76.3 12.7 

25-44 35.6 9.1 

45-64 33.1 12.9 

gg sS 65 and over 58.4 22.5 

ulletin, Metropolitan Life Insurance Company, Vol. 42, 


ource 5 
of basic data: Statistical B 


ebr 
uary, 1961. 
o the data in the above table. Read each state- 


s refer t 
following key. 


irecti 
ent miis The following statement 
Circle, mark your answer according to the g 
°: S if the statement is supported by the data in the table. 
R if the statement is refuted by the data in the table. i 
Or N if the statement is neither supported nor refuted by the data in the table. 
N 1. The death rate from motor vehicle accidents is higher for men than for 


women. š 
jor cause of death among young men be 


SRO 
2. Motor vehicle accidents are & ma 
tween the ages of 20 and 24- 
o more safely than teen age boys between 


S 

Ë @ 3. Men over 65 years of age drive ™ 
S 15 and 19 years of age- 

@ N 4. The largest number of people 


s ® of age or over. R wa 
N 5. When all ages are combined, only a us TH POE ot T EE E E E" 

icle idents. en 

attributed to mot Je accidents. w 


killed in motor vehicle accidents are 65 y 
years 


164 Constructing Classroom Tests 


Ability to Recognize Assumptions | 


Another learning outcome pertinent to the interpretation: of various types of 


z aaa S x i a 
data is the ability to identify unstated assumptions which are necessary to 
conclusion or course of action. 


EXAMPLE 1° 


š š in- 
Statement oj Facts: The following table represents the relationship between the yearly 
come of certain families and the medical attention they receive. 


Per Cent of Family 
Members Who Received 


No Medical Attention 
During the Year 


Family Income 


Under $1,200 


š 47 
$1,200 to $3,000 . 40 
$3,000 to $5,000 33 
$5,000 to $10,000 eens ane 24 
Over 910000 ri on sca eases oh PAES, E 


. sy: > . H amilies 
Conclusion: Members of families with small incomes are healthier than members of f 
with large incomes. 


> s š š lusion 
Assumption: Which one of the following must be assumed to make the above conc 
true? Check one. 


1. Wealthy families had more money to spend for medical care. 

x 2. All members of families who needed medical attention received it. sete 
pe x. doe 
—— 3. Many members of families with low incomes were not able to pay their 

bills. 
= 4 Me 


mbers of families with low incomes often did not receive medical attentio 


EXAMPLE II? 


Items 37-43 are concerned with the 
One of the methods formerly used 
calculation based on the 


following situation. 

by geologists to determine the age of the earth ee 
amount of salt (NaCl) in the ocean, and the amount adde e 
ocean waters each year by the rivers that empty into the ocean. If this method of age 
termination is used, certain assumptions must be made. 

Items 37-43 consist of a number of assumptions. The assumption in each item is 


was 8 


KEY: 1. Necessary for the calculation and is probably true. 
2. Necessary for the calculation but is probably false. 
3. Not necessary for the calculation but is probably true. 
4. Not necessary for the calculation and is probably false. 
37. 


The salt concentrations of the oceans is gradually increasing. (1) 

38. Oceans have been on the earth since our planet was formed. (2) 

39. Ever since its origin, the earth has revolved around the sun. (3) 

40. The oceans now contain all the salt that has ever been added to them. (2) 


7 Louis M. Heil, et al., “The Measurement of Understanding in Science,” Chapter VI in 
The Measurement of Understanding, page 127, National Society for the Study of Education, 
45th Yearbook, Part I. Copyright, 1946, by Nelson B. Henry, Secretary of the Society. Used 
by permission of the publisher, University of Chicago Press. 

° Clarence H. Nelson, Let's Build Quality Into Our Science Tests, page 9, Copyright © 
1958 by the National Science Teachers Association (Washington, D.C.: National Education 
Association). Used by permission of the publisher. 


Complex. Achieuement: The Interpretive Exercise 165 


4l. Th 
a e salts whi r 

the ae Si rivers have carried to the oceans have 
» The pro Bay they were dissolved by the river water. (4) 

through portion of the lithosphere existing above the ocea 
The co = the geologic ages. (2) 

ntine ; : r : A š. 
tion-of ntal masses have existed in essentially their present outline since the forma- 
of the earth. (2) 


all occurred in mineral form in 


n waters has been constant 


Abili 

it : 

y to Recognize Inferences 
to draw inferences 


easures the extent to which pupils 
ted inferences drawn from a 


In i 
nter A A 
preting written material it is frequently necessary 


fro 

m the f t: 

are able ig given, The following exercise ™ 
o š 

passage, recognize warranted and unwarran 

EXAMPLE 

it is possible to establish other 

g This is called drawing infer- 

f facts which may be properly 


Directions: 
mation below is true, 
for reasonin 


f kinds o! 


acts using Aiming that the info 
ences, Ther e ones in this paragraph as a basis 
e is, of course, a limit to the number o! 


In err 
e 
d from any statement. 
t a statement is TRUE, 


indicate tha! 


aay witi 
M it uay bu the proper symbol in the space provided, š 
is UNTR a properly inferred from the information given in the paragraph. Indicate that it 
UE, if the information given in the paragraph implies that it is false. Indicate that 
ot be inferred one way or the other. 


NO inp 
ee oben can be drawn if the statement cannot ! 
Se the file information given in the paragraph as a basis for 
Tit r sasa symbols in writing your answers: 
Bo He statement may be inferred as TRUE. 
N—~if e statement may be inferred as TRUE. 
if no inference can be drawn about it from the paragraph. 


Paragraph A Agh 
re several famous universities established 


By 

A th 

Eur e close of the thirteenth centu 

be Ope, though of course they were very m modern ones. One of the earliest 

founded was one of the most widel s< was the University of Bologna, where 
jshed to have the best training in studying Roman 


Stude: 
nts from all countries came who wis + 
sophy and theology went to the University of 


aw, S F 
` Students i i d i hilo: 
ari especially interested in p Wapu n & 
Tis. Those who wished to study medicine went to the Universitles of Montpelier or 
erno, s 

son Paragraph A 


Question 
A tween people occasionally in those days. 


(F) 


your responses. + + + 


“There were law suits be 
- The professors were poo 
In the Middle Ages peop 


rly paid. , : 
Je were not interested in getting education. 
hat time. 


1, 
2 
3. 
a 4. There were books in Europe at n: vi aniversities wa 
(N 5. Most of the teaching 1" UE gts ld o PRS 
( ) 6. There was no place where students could go to study. 
F) 7. There were no doctors in Europe at this Pe. 
(F) 8, There was no way '° travel during the Middle Ages. 
9 to be a priest, he would probably attend the University of 


(T) 


. If a student wante 


Paris. 


(N) 10. There were m jes in Europe before the thirteenth century. 


o universit 2 _ 
(N) 11. There was only one Janguage in Europe at this time. 
+ Horace T. Morse and George H. McCune, Selected Items for the Testin 
i i ai- ting of S. i 
end Critical Thinking» page oe n eae LSS ge Edition. Copyright 4 in ` w 
lonal " the S cia udies as ington, G N . P DY a- 
Council for t publisher- ational Education Association) 


Used by permission of the 


166 Constructing Classroom Tests 


Ability to Interpret Experimental Findings 


To determine the extent to which pupils understand scientific methodology, 
Nelson’ has suggested the use of interpretative exercises based on classical ex 


s ; š p. of 
periments. The following example is one he based on Francesco Redi’s study 
spontaneous generation. 


EXAMPLE? 


Items 20-24 are based upon the following situation. 

PROBLEM: How do the simpler living organisms originate? nogi 

HYPOTHESIS 1: Flies may be produced by spontaneous generation from dead organic s| 

stances. 

EXPERIMENTAL TEST OF HYPOTHESIS E are 
Redi, an Italian physician and scientist, put pieces of fresh meat inside of one set 0 m 

which he immediately sealed with parchment (represented by Jar A in the Figure “ea 

Inside of another set of jars (represented by Jar B in the Figure below) he also put ie 

pieces of fresh meat, but these jars he left open. Later he observed flies entering and ae 

the open jars at will. No flies could enter the closed jars. Some days later the meat in 


ed 
: P f Pee he closed 
open jars teemed with maggots, but no maggots developed in the meat inside of the 

jars. 


As a refinement in procedure, he repeated the above experiment but instead of me 
one set of jars with parchment he now closed them with fine-meshed gauzelike Naples fesh 
ing (represented by Jar C in the Figure below). Flies, attracted by the odor of tie ling 
inside the jars, frequently alighted on the veiling, occasionally depositing eggs on the "i in 
These eggs soon hatched into maggots on top of the veiling, but no maggots develope 
the meat inside the jars, 


A B c 


For items 20-23 mark space H kedai 
l—if the item is true according to the data and tends to i sae aie Hs ; 
2—if the item is true according to the data but tends to refute ypothesis I; ñawpa 
3—if the item is irrelevant to Hypothesis I, regardless of its truth or falsity according 
the data; Hy- 
’ ; ñ s" n rt Hy: 
4—if the item is false according to the data, but if true, would tend to suppo 
pothesis I; š J Hy- 
5—if the item is false according to the data, but if true, would tend to refute Hy 
pothesis I. 


P ingi š ional 
8C. H. Nelson, Lets Build Quality into Our Science Tests (Washington, D.C.: Nationa 


Science Teachers Association, 1958). y m š ight © 
= P Our Science Tests, page 13. Copyright 
° Clarence H. Nelson, Let’s Build Quality Into A 5 n si 
1958 by the National Science Teachers Association (Washington, D.C.: National Education 


Pian siete 5 isher. 
Association). Used by permission of the publishe 


Complex Achievement: The Interpretive Exercise 167 


SUGGESTION: It will be easier if you decide first whi 
20, Tule according to the data. : 
ih the r on the basis of what happened in all three jars, no maggots were to be seen 
B] Tae B are in Jar A because they suffocated in this tightly closed jar. (4) ! . 
eee hich was the only jar that the flies could enter, was also the only jar in which 
22. The a ‘appeared in the meat. (2) 
tiny ggots which appeared on top of the veiling of Jar C appeared there only because 
A E of decaying meat that were carried upward through the veiling by the 
Mie ing air turned into maggots. (4) 
gots were not found in the meat in Jar C for the same reason that they were not 
ie: in Jar A. (2) 
n the basis of Redi’s data alone, what 
l. It is established as true beyond doubt. 
2. It is probably true—the evidence tends 


3. It remains as much unsettled as at the outset 
evidence tends to re 
ny doubt whatsoever. 


ether each statement is true or false 


23. 


24, 
is the status of Hypothesis I? 


to support it. 


a. mes š 
4. It is probably false—the fute it. 


5. It is definitely false without a 
Use of Pictorial Materials 
Pictorial materials can serve two useful purposes in interpretive exercises. 
ie, can serve as a medium for measuring a us of ort sane ta 

ar to those already discussed. It is simply a matter 0 Pepiacine hd ten 
Or tabular data with some pictorial form of presentation. This use is especially 
“sirable with younger pupils and in areas where the ideas can be more clearly 
“avead in pictorial form. (2) They x also serve as a direct measure of the 
ability to interpret graphs, cartoons, maps, and other pictorial materials. In 
Many school subjects, hese kè important learning outcomes in their own right. 
t The following examples illustrate a few of the many uses of pictorial mat 
rials, The circle graph was intentionally kept simple to show that some pic- 


torj 
‘al forms can be easily constructed. 


EXAMPLE I 


The graph at the left represents 
the way Bill spends his weekly 
allowance of $2. Answer the fol- 
SCHOOL lowing i 
bate ng questions, based on the 
256 graph, by circling the letter of 
the correct answer. 


168 Constructing Classroom Tests 


1. What is the ratio of the amount Bill spends for school supplies to the amount he spends 
for movies? 
A 7:2 
1:3 
C 2:7 
D 3:1 
2. What would be the best title for this graph? 
A Bill’s weekly allowance 
B Bills money graph 
Bill’s weekly expenditures 
D Bill’s money planning 


EXAMPLE 1119 


š sati in the 
1. The cartoon illustrates which of the following characteristics of the party system 
United States? 
Strong party discipline is often lacking. 
B The parties are responsive to the will of the voters. Ë P =. 
C The parties are often more concerned with politics than with the national welfar 
D Bipartisanship often exists in name only. 9 
P i i times: 
2. The situation shown in the cartoon is least likely to occur at which of the following ti 
À During the first session of a new Congress 
B During a political party convention 
C Duringa primary election campaign s 
During a presidential election campaign 


*° Educational Testing Service, Making the Classroom ne neo. Eee m 
18, Evaluation and Advisory Service Series, No. 4. sue s publish es: 
Service (Princeton, New Jersey). Used by permission o 


Complex Achievement: The Interpretive Exercise 169 


EXAMPLE III! 

rom the data which are given 
ers in most instances must be 
d cities is not shown. To 


s Fabien questions you are asked to make inferences f 

Drobabilii, map of the imaginary country, Serendip. The answ 

Gas yor es rather than certainties. The relative size of towns ani nota ' 

equates sal location of the places mentioned in the questions, the map is divided into 
ed vertically from A to E and numbered horizontally from T to:5; 


SERENDIP 


++ 
s 1000 feet =" 


i] 
Contour interval i 


Whi eR the best location for a steel mill? 
jh of the following cities would be 


. Li 
(3A) 
Um (3B) 
p Cot (3D) 
Dube (4B) 


H d r Multiple-Choice Questions: A Close Look, pag. 
Ed $ ting Service © i : Ç > page 12. Copy- 
right S T onal Testing Service (Princeton, New Jersey). Used by B 


9 
f the publisher. 


170 Constructing Classroom Tests 


EXAMPLE IV12 


This question is based on the following situation: e holed 
A piece of mineral is placed in a bottle half-filled with a colorless liquid. A two ae 
tubber stopper is then placed in the bottle. The system is then sealed by inserting a 


n " i ater as 
mometer and connecting a glass tube to the stoppered bottle and a beaker of limew: 
shown in the accompanying diagram: 


thermometer — | 


colorless 


liquid limewater 


mineral 


The following series of observations is recorded: 


I. Observations during the first few minutes: 


n mineral. 
1. Bubbles of a colorless gas rise to the top of the stoppered bottle from oi peer 
2. Bubbles of colorless gas begin to come out of the glass tube and rise to the š 
of the limewater. 
3. 


The limewater remains colorless throughout this period of time. 
4. The thermometer reads 20°C 


II. Observations at the end of thirty minutes: 


- Bubbles of colorless gas continue to rise in the stoppered bottle. 

. The piece of mineral has become noticeably smaller. 

. There is no apparent change in the level of the colorless liquid in the bottle. 
The colorless liquid in the bottle remains colorless. 

. The thermometer reads 24°C. 

The limewater is cloudy. 


Au = ç to = 


; PIENE les at the 
Which one of the following is the best explanation for the appearance of gas bubble 


end of the tube in the beaker of limewater? 


P r. 
š mewate 
A The pressure exerted by the colorless liquid is greater than that exerted by the li 


° red 
i i P; z: ssur the stoppe 
® The bubbles coming from the mineral cause an increased gas pressure in th 
bottle. 


” ; he 
" š; a: f gas int 
C The temperature increase at the end of thirty minutes causes an expansion of g 
stoppered bottle. 


' "Y a i oppered 
D The decrease in the size of the piece of mineral causes reduced pressure in the stopp 
bottle. 


re Ë aker. 
E The glass tube serves as a siphon for the flow of gas from the bottle to the bea 


i ions: A Close Look, page 36. Copy- 
a i i ice, Multiple-Choice Questions é "P E ae 
of the publisher. 


Complex Achievement: The Interpretive Exercise 171 


EXAMPLE V!Š 


DISTRIBUTION OF SCIENTISTS 
1960 


BY U.S. GEOGRAPHIC DIVISION, 


NEW 
ENGLAND 
7% 


MIDDLE 
ATLANTIC 
22% 


MOUNTAIN 
6% 


T 
Neat SCIENTISTS 
LUDED 201,292 


F 

ee AREAS AND NO REPORT 3% 

a OTE: HAWAII AND ALASKA ARE IN THE PACIFIC DIVISION 

OURCE: NATIONAL REGISTER OF SCIENTIFIC AND TECHNICAL PERSONNEL, 1960 


o the data in the above map. Read each state- 
following key- 

tto make the statement true. 

the statement false. 

her the statement is true 


Dir, 

ecti š 

ment ions: The following statements refer t 

Ciro s mark your answer according to the 
: T—if the data in the map is sufficien! 


F—if the data in the map is sufficient to make 
I—if the data in the map is insufficient to determine whet 


® F or false. : 
I 1. The South Atlantic division has more than twice a 
North Central division. ' . 

] in the Middle Atlantic division than in any other 


T F 
F © 2. More scientists are trained 


s many scientists as the West 


division. 
TF © 3. The number of scii 
than the Western divisions. 
4. There are more than 6,000 scien! 
5. There are fewer scientists per squ: 
vision. 


d fo 


entists is increasing more rapidly in the Eastern divisions 
tists in the East South Central division. 

are mile in the New England division than in 
any other di N i 

Q 6. There is less nee Mountain division than in the Pacific 


division. 


r scientists in the 


is based was reproduced from Scientific Manpouer Bul 
er Bul- 


s item 
‘oundation (Washington 25, D.C.: Apri 
5 5, D.C.: April, 1962). Used b 
* 


13 The map on which thi 
nce Fi 


leti š M 
etin, No. 17, National Scie 
Permission of the publisher- 


172 Constructing Classroom Tests 


ADVANTAGES AND LIMITATIONS OF INTERPRETIVE EXERCISES 


The interpretive exercise has several advantages over the single, nae 
objective test item. First, the introductory material makes it possible to rie 
ure an ultimate objective of education and one of increasing importance in a 
school subjects. That is, the ability to interpret written materials, charts. ee 
maps, pictures, and other communication media encountered in everyday ne 
ations. The rapid expansion of knowledge in every subject-matter area ik 
made it impossible to learn all of the important factual information mA eri 
field. This has led to greater dependence on libraries, reference materials, ey 
study techniques, and consequently interpretive skills. Second, the mea a 
exercise makes it possible to measure more complex learning outcomes tha 
can be measured with the single objectiv 


re- 
e item. Some data, such as that p 
sented in the interpretiv 


e exercise, is necessary if pupils are to demont 
thinking and problem-solving skills. The inclusion of such data in i 6 
test items is possible but awkward. In addition, by having a series of ae 
test items based on a common set of data, both greater depth and breadth E 
be obtained in the measurement of intellectual skills. Third, the ee 
exercise minimizes the influence of irrelevant factual information on the ale 
urement of complex learning outcomes. As we noted with the single mu F. z 
choice item, a pupil may be unable to demonstrate his understanding e 
principle simply because he lacks some of the specific facts concerning e 
situation in which application is to be made. This blocking of response, ai 
by a lack of detailed factual information not directly pertinent to the purp In 
of the measurement, can be largely eliminated with the interpretive es 
the introductory material, we can provide pupils with the common backgro a 
of information needed to demonstrate understandings, thinking skills, a 
problem-solving abilities. A asur- 
The main advantage of the interpretive exercise over the essay test, in i ‘Is 
ing complex achievement, is derived from the greater structure provided. Sct 
are not free to redefine the problem or to demonstrate those thinking skills w 
which they are most proficient. The series of objective items forces mem 
demonstrate the specific mental processes called for. This, of course, also pues 
it possible to measure separate aspects of problem-solving ability and to 
objective scoring procedures. 1 ite 
As with all forms of test items, the interpretive exercise has some s 
limitations.15 Probably the greatest limiting factor, and 9ne that may at 
occurred to you as you reviewed the sample items, is the difficulty of TONEM rë 
tion. Selecting printed materials which are new to the pupils but which " 
relevant to the instructional outcomes requires considerable searching. wie 
pertinent material is found, it usually must be edited and reworked to make 1 


HR. L. Ebel, “Writing the Test Item,” Educational Measurement, ed, E. F. Lindquist 
TEs > z i 1951). 
hin 3 ican Council on Education, m 
T . e el, D 


(Washington, D.C.: American Council on Education, 1951). 


173 


items must be constructed which 


The Interpretive Exercise 


Complex Achievement: 


More suj 

call . for testing purposes. Then, test 
ured, The oe behaviors indicated in the learning outcomes being meas- 
etween sete poy is most often a circular one. That of going back and forth 
Satisfactory prod he introductory material and revising the test items until a 
Tequires aS ies is obtained. This entire procedure is time consuming and 
items. Three renter skill than that needed to construct single objective test 
structing ; positive comments can be made regarding the difficulty of con- 
i 8 interpretive exercises. (1) There is an increasingly large number of 
aring in various subject-matter fields. The 
merous examples which may 
sed instructional emphasis on 
of interpretive exercises ofi- 
(3) The task becomes 


illustrative ; 
references items. of ‘this type #ppe 
Serve as aq the end of this chapter contain nu 

guides to test construction. (2) The increa 
from the use 


com 
plex 1 Š 

earning outcomes resulting 
t construction. 


Sets t] san 

been ailitional effort required in tes 
RO tt and experience. 

in written f mitainon, especially pertinen e i 

orm, is the heavy demand on reading skill. 


andi 
itt ig by both the difficulty of the reading material a 
The first problem can be controlled some- 


al B 

What cn fo read each test exercise. re eet Š 
oth of a the reading level low and the secon y using = Re 

“ataq kaa: only partial solutions, however, since the poor reader wi sti 
tain a ecided disadvantage. In the primary grades and in classes which con- 
limited predominance of poor readers, interpretive exercises might better be 
n pi: the use of pictorial materials. 
as a me parison to the essay test. the interpre o1 
asure of complex achievement. First, it cannot measure a pupil’s over-all 
ng specific aspects of 


ap š 
A roach to problem solving. It is efficient for measur 
‘e problem-solving process but it does not indicate whether the pupil can 


ìntegrate and use these specific skills when faced with a particular problem. 
cone it provides ‘a diagnostic view of the pupils’ problem-solving abilities, in 
im trast to the wholistic view th btained with essay questions. Second, 
‘a the interpretive exercise usually uses selection items, it is limited to learn- 
Se outcomes at the recognition level. To measure the ability to define problems, 
formulate hypotheses. to organize data, and to draw conclusions, supply pro- 
cedures such as the essay test must be used. 
OR CONSTRUCTING INTERPRETIVE EXERCISES 
r tasks in constructing interpretive exercises: (1) the 
ductory material, and (2) the construction of a 
In addition, special care must be taken to con- 
í the introductory material in terms 


e the introductory material is 
The poor reader is 
nd the length of time 


t wher 


etive exercise has two shortcomings 


nat can be 0 


SUGGESTIONS F 


There are two majo 
Selection of appropriate intro 
Series of dependent test items. š: 
Struet test items which require an analysis o 
of complex learning outcomes. The following suggestions will aid in constructing 

D 


interpretive exercises © high quality. S 
l. Select introductory material that ts in harmony with the objectives of th 
i he 


ve exercises; like other testing procedures, sh 
. should meas 
sure the 


course. Interpreti + 
achievement of specific instructional goals. Success in this regard 
s regard depends to a 


174 Constructing Classroom Tests 


large extent on the introductory material, since this provides the sama 
setting on which the specific test items are based. If the introductory materia U 
too simple, the exercise may become a measure of general information or sps 
reading skill. On the other hand, if the material is too complex, or ES F 
instructional goals, it may become a measure of general reasoning ability. La i 
extremes must be avoided. Ideally, the introductory material should be pertinent 
to the course content and complex enough to call forth the mental reactions 
specified in the course objectives. ; a ai 
The amount of emphasis given to the various interpretive skills in the m 
objectives is also a factor here. Care must be taken not to overload the test se 
interpretive items in any particular area. The selection of introductory mater T 
should be guided by the general emphasis to be given to the pageant a, 
complex achievement and by the relative emphasis to be given to each speci 
type of interpretive skill. i n 
2. Select introductory material that is appropriate to the curricular exp ë 
ences and reading leuel of the pupils. Many complex learning outcomes can y 
measured with different types of introductory material. The ability to omnes 
the validity of conclusions, for example, can be measured with written une 
tables, charts, graphs, maps, or pictures. The type used should be a aa 
the pupils so that the nature of the material does not prevent them from dër ; 
strating their achievement of the complex learning outcomes. It would be grossly 
unfair, for example, to ask pupils to recognize the validity of conclusions wae 
on data presented in graph form, if they had not had experience in interpretiné 
graphs similar to those used in the test. 
Where various types of introductory 


all 
material will serve a purpose equ 4 
well and where they are all f 


Sw Š that 
amiliar to the pupils, we would tend to favor ils 
P š I A ; pup 

material which places the least demand on reading skill. For elementary pup?” 


ane ° Š s, pictoria 
pictorial materials would be definitely favored. For higher grade levels, pic 
materials and verbal materials with a 1 


ow vocabulary load and simple canna’ 
would be given preference. Although general reading skill is a factor in ken 
written tests, it can become an especially prominent factor in interpretive ex 
cises unless special efforts are made to minimize its influence. ge 
3. Select introductory material that is new to pupils. In order to measu a 
complex learning outcomes, the content of the introductory material must ort 
tain some novelty, Asking pupils to interpret materials identical to those aa 
in instruction provides no assurance that the exercise is measuring anyt sali 
other than rote memory. Too much novelty, however, must be avoided. Matone 
similar to those used in class but which vary slightly in content or form 4 : 
most desirable. Such materials can usually be obtained by modifying sentra 
from textbooks, newspapers, news-magazines, and various reference materia ™ 
pertinent to the course content. hod 
4. Select introductory material that is brief but meaningful. Another meth 


f 
fee ee “ : z ent. © 
of minimizing the influence of general reading skill on the measurem 

complex learning outcomes is to kee 


; ; ief as 
p the introductory material as brie 
possible. 


' 
a š " er d raw 
Digests of articles are frequently available and provide goo 


Complex Achievement: The Interpretive Exercise 175 


material for i i 
= ee š 
r interpretive exercises. Where digests are unavailable, the summary 


of an arti 
articl k i i 
e or a key passage may provide sufficient material. In some cases 
: 
dequately in a table, diagram, 


the r . ë 
relevant information is summarized more a 
or picture. 
In strivi ie ee 
whieh Cape for brief introductory material, be careful not to omit elements 
of costo — to the interpretive skills being measured. The material should, 
rse, also be c š ; ; 
Bagi, e complete enough to be meaningful and interesting to the 
Bea introductory mate 
ive š 
ihe es: Although some materials (for example, graphs) 
Techni te most selections require some adaptation for testing purposes. 
ical arti : ° Sage 
dic thes articles frequently contain long, detailed descriptions of events. On 
a hand, news reports and digests of articles are brief but frequently 
exaggerated reports of events to attract reader interest. While such 


€Xagver, 
Sgerated reports provide excellent material for measuring the ability to 
d for assumptions, the validity of 


ally be modified to be used 


rial for clarity, conciseness, and greater inter- 
can be used 


Be relevance of arguments. the nee 
effecti ions, and the like. the material must usu 
vely. 

aa the introductory materi 
fiiaim way arg procedures. Se 
revisions a i used, and the construction of t ) 
ple, assum ue material. In revising a description of an experi ° 
ieee. may. hypotheses. or conclusions which are explicitly stated in the 
š question alli be deleted and used as a basis for questions. By the same token, 
addition wa ing for application of the experimental findings may require the 
Material a new material to the selection. Thus the revision of the introductory 

nd the construction of test items proceed forw circular fashion 
se evolves.15 


n of the related test items 
| frequently suggests 
necessitates 


al and constructio 
Rewriting of materia 


est questions often 
ment, for exam- 


ard in a 


e intro- 
Í inter- 


until 
ac “qi 4 7 
clear, concise interpretive exerci 


quire analysis and interpretation of th 


n the construction 0 
of complex achievement. 
are answered directly in the introductory 
hich is explicitly stated in 
The second is to 


6. C 
‘ Co $ 3 
ucto nstruct test items which re 
r É a 
lien material. There are two common errors 1 
e " GE E 
exercises which invalidate them as a measure 


ne j ; 
mu lana questions which ar i 
the Aaa is, asking for factual informabor w Be 
include n. Such questions measure simple reading skill. Se ce 
ductor epestioris which can be answered correctly without reading e intro- 

y material—that is, requiring answers based on general information 1m 
of course, merely measure simple knowledge outcomes. 


intended, it should include only 


unction as 1 
ad the introductory material and to 


terpretations will 


the 
Tm These questions, 
e interpretive exercise is to f 


tho 
se a > f 
test items which require pupils to re 
nstances, the in 


mak : A 
tegu the desired interpretations. In some 1 

Ire pupils to supply knowledge beyond that presented in the exercise. In 

ed to the factual information provided. 


other: ; 
he K the interpretations will be limite 
relative emphasis on knowledge and interpretive skill will be determined 


l 
R; > š " " n 
L. Ebel. “Writing the Test Item,” Educational Measurement, ed. E. F. Lindquist 


(Washi 
shington, D.C.: American Council on Education. 1951). 


176 Constructing Classroom Tests 


by the specific learning outcomes being measured. Regardless of the empha 
however, the test items should be dependent on the introductory material, while 
at the same time calling forth mental reactions of a higher order than those 
related to simple reading comprehension. he 
T. Make the number of test items roughly proportional to the length of Ë 
introductory material. It is inefficient to have pupils analyze a long, comp a 
selection of material and answer only one or two questions concerning id 
Although it is impossible to specify the exact number of questions which 3 
accompany a given amount of material, the items presented earlier in this c E 
ter illustrate a desirable balance. Other things being equal, we shall aA 
favor the interpretive exercise which has brief introductory materijal an 
relatively large number of test items. nent 
8. In constructing test items for an interpretive exercise, observe all pe 
suggestions for constructing objective items. The form of test item used z = 
interpretive exercise will determine the suggestions for construction which a 
greatest value. If common forms of the multiple-choice or shon © 
item are used, the specific suggestions for constructing these item types A E 
be observed. Where modified forms are used, suggestions for consents um 
of the various types of objective items should be reviewed for their applica e 
in construction. Freedom from irrelevant clues and technical defects is as 1 
portant in interpretive exercises as it is in single, independent test items. and 
9. In constructing key-type test items, make the categories homogeneous es 
mutually exclusive. The key-type item, which is used rather frequently in 1" set 
pretive exercises, is a modified multiple-choice form which uses a page 
of alternatives. In this regard, it is also similar to the matching item. Its as; 
struction should be guided by the suggestions for constructing these item p: = 
with special attention devoted to the categories used in the key. All o 


concerned with similar types of judgment. At the same time, there shoul aah 
no overlapping of categories. Each alternative should provide a ay so 
separate category so that a clear-cut system of classification is provided a 

that each item has only one correct answer. 


EXAMPLE 


duces 
The majority of dental scientists are in general agreement that fluoridated water oe 
tooth decay. A number of cities have fluoridated their water supply and reports í cam- 
that fluoridated water is both safe and inexpensive. Despite an intensive ee ates 
paign pointing out the benefits of fluoridated water many cities do not have fluoridate theif 
Resolved: In the interests of national health, all cities should be required to fluoridate 
oe m each of the following statements carefully. In front of each statement mark 
KEY: A—if the statement supports the resolution. 
B—if the statement contradicts the resolution. 
C—if the statement is a fact. ` 
ee fae ae ee Sb rece tet water on an individual’s health have not 


een studied. ñ 
i (Similar items complete the exercise.) 


Complex Achievement: The Interpretive Exercise 177 


In the above example note that the key includes two overlapping types of 
pie One is concerned with the relationship of each statement to the reso- 
: and the other with the nature of the statement itself. This makes it 
oe to have only one correct answer for each statement. Item 1, for 
ae le, would have to be marked category B because it contradicts the reso- 

and category C because it is a statement of fact. 
above key could be improved by limiting the categories to the relevance 
e statements to the resolution, as illustrated in the following key. 


KEY. : ° 

EY: A—if the statement supports the resolution. 
B—if the statement contradicts the resolution. 
C—if the statement neither supports nor contradicts the resolution. 

ment and its relevance is signifi- 


If judeing both the factual nature of a state 
h a way that discrete categories 


can A " 
a t, these two elements can be combined in suc 
ri š 
€ obtained as follows: 
KEY: A—if it is a statement of fact which supports the resolution. 
„B—if it is a statement of opinion which supports the resolution. 
C—if it is a statement of fact which contradicts the resolution. | 
D—if it is a statement of opinion which contradicts the resolution. 
dgment in one category is 


The major drawback to combining two types of ju 
ecially undesirable with 


the greater complexity of the key. This would be esp 
younger pupils. 

9. In constructing key-type t 
“pplicable. Despite the usefulne 


o, d a 
Oomplex achievement, it has not been used b 
big factor in restricting its use has been the difficulty of construction. The 


Popularity of the key-type item in interpretive exercises can probably be 
attributed to the fact that it uses a common set of alternatives. This makes it 
easier to construct than the regular multiple-choice form which requires a 
different set of alternatives for each item. . I 

tion of key-type inter- 


It is frequently possible to simplify further the construc ; ° 
Pretive exercises by developing key categories that can be reused with different 


; the ability t i 
Content. For example, a learning outcome such as y to recognize 


assumptions might lead to the following key: 


develop standard key categories where 


est items, 
for measuring 


ss of the interpretive exercise 
extensively by classroom teachers. 


h is necessary to make the conclusion valid. 


lidate the conclusion. 
validity of the conclusion. 


KEY: A—an assumption whic 
B—an assumption which would i 
C— an assumption which has no 


nva 
bearing on the 
description of a situation, a conclusion 


This key could be used with a brief 
1 sumptions. Both the key and the form 


based on the situation, and a list of as 
of the item could be used repeatedly with only the content varying. Although 


Selecting new content material is still a problem, the framework provided by 
the standard key categories simplifies the process considerably. 

This use of standard key categories is. 
Id not be permitted to determine which learning outcomes aaa 
Rather, the time and effort saved by such procedures should free 


of course, not applicable in all areas 


and shou 
emphasis. 


178 Constructing Classroom Tests 


the teacher to explore more creative applications of the interpretive exercise 1n 
other areas. 


SUMMARY 


Complex achievement refers.to those learning outcomes based on the higher 
mental processes. Such outcomes are classified under a number of general head: 
ings including understanding, reasoning, thinking, and problem solving. The 
attainment of goals in these areas can be measured by both objective and en 
jective means, The most commonly used objective item is the interpretive 
exercise. 

The interpretive exercise consists of a series of objective questions based = 
written materials, tables, charts, graphs, maps, or pictures. The questions require 
pupils to demonstrate the specific interpretive skill being measured. For ir 
ple, pupils might be called upon to recognize assumptions, inferences, ger 
sions, relationships, applications, and the like. The structure provided by tn 
interpretive exercise makes it possible to obtain independent measures of g 
specific aspect of thinking and problem-solving skill. While it is efficient i 
measuring such specific learning outcomes, it does not provide evidence ad 
cerning the ability of pupils to integrate and use these skills in a global . 
on a problem. Thus it is limited to a diagnostic analysis of problem-solvinë 
skills. sse 

Probably the major factor in retarding the use of the interpretive — 
has been the difficulty of construction. This process involves (1) the selection 
appropriate introductory material, (2) a revision of the material in harmony 
with the outcomes to be measured, and (3) the construction of a series a 
dependent test items that call forth the desired behavior. Although these step’ 
admittedly time consuming, the rewards in terms of improved teaching-learni"e 
practices would seem to justify the time and effort required. 


SUGGESTIONS FOR FURTHER READING 


d 

Ahmann, J. S., and M. D. Glock. Evaluating Pupil Growth. 2nd edition, Boston: Allyn an 
Bacon, 1963. Chapter 4: “Measuring Understandings Objectively.” wil 

Dressel, P. L., and L. B. Mayhew. Critical Thinking in Social Science. Dubuque, Towa ire 
liam C. Brown, 1954. Suggestions for evaluation and teaching at the college level. 
trative test items in appendix. Material also useful for secondary-school teachers. ait 

Dressel, P. L.. and L. B. Mayhew. Science Reasoning and Understanding. Dubuque, ays 
William C. Brown, 1954. Describes and illustrates how science articles can be used — 
teaching and evaluation at the college level. Approach is also useful at the eon” 
level. 

Furst, E. J. Constructing Evaluation Instruments. New York: Longmans, Green, 1958. beet 
214-232, 261-274. Suggestions for constructing classification exercises for measuring 
intelectual skills. 

Travers, R. M. W. Educational Measurement. New York: Macmillan, 1955. Chapter 5: “The 
Trend Toward the Measurement of Skills Considered Basic Outcomes of General Edu- 
cation.” 


179 


Complex Achievement: The Interpretive Exercise 


Illustrative Test Items 
National Council of Teachers of Mathematics. Evaluation in Mathematics. Twenty-sixth 
“Analysis of 


Yearbook, Washington, D.C.: The National Council, 1961. Chapter 5: 


Ñ Illustrative Test Items.” 
elson, C. H. Let’s Build Quality Into Our Science Tests. Washington, D.C.: National Sci- 
ence Teachers Association, 1958. 
dy Skills and Critical 


Morse, H. T., and G. H. McCune. Selected Items for the Testing of Stu 
Thinking. National Council for the Social Studies, Washington, D.C.: National Educa- 
tion Association, 1957. 
Multiple-Choice Questions: A Close Look. Test Development Division, Princeton, NJ.: 
Educational Testing Service, 1963. Illustrates the use of the multiple-choice item for 
measuring complex achievement in a variety of fields. Maps, graphs, pictures, diagrams, 
and written materials are used. Each item is followed by a statistical and logical analysis 
of its effectiveness. 


ey | 

Chapter 10 
measuring 
complex 
achievement: 
the essay test 
KUY 


Som. ¿ i 
7 RSPEI of complex achievement cannot be measured objectively 
E si ae 
age s. outcomes which indicate pupils are to originate ideas . - * tQ 
nize i ; 
ganize and express ideas . . . and to integrate ideas in a global attack 0” 


a probl ire 
P em . . . require the greater freedom of response provided by the 
essay test. 


M poet main concern has been with objective test o b. 
Buse ts era. th can measure a variety of learning outcomes; ae 
Sid vie > ‘ at the INES EPIERUe exercise is especially usefu rel 
: ng complex achievement. Despite this wide applicability of objective 
item types, there remain significant instructional outcomes for which no satis- 
factory objective measurements have been devised. These include such outcome? 
as the ability to recall, organize, and integrate ideas; the ability to express 
oneself in writing; and the ability to supply rather than merely recognize inter 
pretations and applications of data. Such outcomes require less structuring ° 
response than that imposed by objective test items. It is in the measurement 
of these outcomes that the essay question serves its most useful purpose. 


FORMS AND USES OF ESSAY QUESTIONS 


Our discussion of the essay question will be limited to its use in the measure 
ment of complex achievement. In doing so, the fact that many teachers use essay 


180 


Complex Achievement: The Essay Test 181 


questions to measure knowledge of factual information is not being disregarded. 
Unfortunately, this is probably one of the major uses in classroom testing. It is 
Just that using essay tests to measure factual knowledge is seldom warranted. 
The distinctive feature of essay questions is the freedom of response permitted 
abs pupil. He is free to select, relate, and present ideas in his own words. While 
tts freedom enhances the value of essay questions as a measure of complex 
achievement, it introduces scoring difficulties which makes them inefficient as 
a measure of factual knowledge. For most purposes, knowledge of factual infor- 
mation can be more efficiently measured by some type of objective item. 

l Essay questions should be used, primarily, for the measurement of those 
earning outcomes that cannot be measured by objective test items. The unique 
fatures of essay questions can be utilized most fully when their shortcomings 
are offset by the need for such measurement. Learning outcomes concerned 
With the abilities to select, organize, integrate, relate, and evaluate ideas, require 
the freedom of tapons and the originality provided for by essay questions. 
n addition, these outcomes are of such great educational significance that the 
“xpenditure of energy in the difficult and time-consuming task of evaluating 


th 
a wees can be easily justified. 
af he freedom of response provide 
air but rather a matter of degree. At one e 


restricted as that in the short-answer objectiv REE l 
may be all that is required. At the other extreme, the pupil is given a most com- 


Plete freedom in making his response and his answer may require several pages. 
Although wcities Ps freedom of response tend to fall along a continuum 
tween these extremes, essay questions can be conveniently classified into two 
types. These are the restricted response type and the extended response type 


r 
eferred to earlier.! 


d by essay questions is not an all-or-none 
xtreme, the response is almost as 
e item. Here a sentence or two 


Restricted Response Questions 
nds to limit both the content and the form 


The r 1 i t 

estricted response question te ee ae 

of pupil r esponse sas content is usually limited by restricting the scope of the 
topic to be discussed, Limitations on the form of response are generally indi- 
cated in the statement of the question. 


EXAMPLES 


List, į; jor differences between the Korean War and previous wars 
in sas R , 
Why is the barometer one of the most use 
™ a brief paragraph. 
escribe two situations which illustr: 
9 not use examples discussed in class. 


ated. 
ful instruments for forecasting weather? Answer 


ate the application of the law of supply and demand 


ATS Ahman and M. D. Glock, Evaluating Pupil Growth (Boston: Allyn and Bacon, 


1963). 


182 Constructing Classroom Tests 


Another way of restricting responses in essay tests is to base the questions E 
specific problems. For this purpose, introductory material like that used m 
interpretive exercises can be Presented. Such items differ from objective nig 
pretive exercises only by the fact that essay questions are used instead o 
multiple-choice or alternative-response items. 


EXAMPLE 


The majority of dental scientists are in general agreement that fluoridating a city’s moe 

supply is a safe and inexpensive method of preventing tooth decay. However, many A 

have not fluoridated their water because the residents voted against it. One of the pe’ 

arguments against fluoridation is that fluoridating a city’s water supply violates the 

vidual’s freedom of choice. idk 
(A) Indicate whether you agree or disagree with the italicized part of the last statem 
(B) List reasons which support your position. 


Because of the greater structure imposed by the restricted response pa 
it is most useful for measuring learning outcomes requiring the ee cs 
and application of data in a specific area. In fact, any of the learning a" š 
measured by an objective interpretive exercise can also be measure retive 
Testricted response essay question. The major difference is that the interp ar 
exercise measures at the recognition level while the restricted SN Tae: 
requires that the pupil supply the answer. In some instances, the objective aval 
pretive exercise would be favored because of the ease and reliability of sco ause 
In other situations, the restricted response question might be favored ja ë 
of its more direct measurement of the learning outcome (e.g., the ability 
formulate valid conclusions). j ssible 

Although restricting pupils’ responses to essay questions makes it p° e 
to measure learning outcomes of a more specific and clearly defined "A E 
these same restrictions make them less valuable as a measure of those pe 
outcomes emphasizing integration, organization, and originality. Restr onse 
the scope of the topic to be discussed and indicating the nature of the ee 
desired limits the pupil’s opportunity to demonstrate these behaviors. For 
outcomes, greater freedom of response is needed. 


Extended Response Questions 


P upil 
The extended response question provides a wide range of latitude. Tar 
is generally free to select any factual information which he thinks is pe — 
to organize the answer in accordance with his best judgment, and to is fof 
and evaluate ideas as he deems appropriate. This freedom makes it possi a 
the pupil to demonstrate his competence in these particular TES ae 
ability to select, organize, integrate, and evaluate ideas. On the other ais 
this same freedom makes the extended response question inefficient for meee 3 
ing more specific learning outcomes and it introduces scoring difficulties 


. : i instrument. 
severely restrict its use as a measuring i 


Complex Achievement: The Essay Test 183 


EXAMPLES 


Compare the administrations of President Eisenhower and President Kennedy in terms of 
ignificant developments in international relations. Cite specific illustrations where possible. 
peel evaluate the significance of the sea captain's pursuit of the white whale in Moby- 
ick, 
Deseri ; š š 
escribe the influence of Mendel's laws of heredity on the development of biology as a 
Science. 
et a scientific evaluation of the Copernican theory of the solar system. Include scientific 
Obs š + 
servations which support your statements. 


The need for a measure of a pupil’s global attack on a problem, like that 
Provided by the extended response question, can easily be defended. The specific 
thinking and problem-solving skills measured by objective interpretive exercises 
and restricted response essay questions seldom function in isolation. In a 
natural situation they operate together in a manner w 
à sum of the specific skills involved. These skills interact with each other and 
With the knowledge and understanding called for by the problem. Thus, it is 
Not just the specific skills we are interested in measuring but also how they 
unction as an integrated whole. oi 

There is general agreement among both teachers and test specialists that the 
extended response question does call forth complex behaviors that cannot be 
measured by more objective means. Where the disagreement arises is in the 
extent to which the scoring can provide a satisfactory measure of these complex 

ehaviors, Test specialists point out that the scoring is so unreliable that such 
Questions should not be used for measurement purposes but should be used as 
teaching devices only. With little regard for the opinions and evidence of test 
Specialists, the majority of teachers continue to use the extended response ques- 
tion in the measurement of pupil achievement. Unfortunately, they all too fre- 
quently do so without regard to the learning outcomes being measured or to 
the complexities involved in the construction and scoring of such questions. 


Neither the “head-in-the-sand” position of the test specialists nor the “every- 
attitude of the teachers seems to be contributing 


t of pupil achievement. It would seem more 


hich includes more than 


thi ; A » 
Ing is coming up roses 


much to the valid measuremen p 
Sensible to identify the complex behaviors we want to measure, formulate ques- 


tions which elicit these behaviors, evaluate the results as reliably as we can, and 
then use these admittedly inadequate data as the best evidence we have available. 


Summary Comparison of Learning Outcomes Measured 

ar’ 

the restricted r 

s imilar to those measured by the objective inter 

Complex learning outcomes © ie dint dhe š y j ctive inter- 
difference is that the interpretive exercise measures 


Pretive exercise. The main A Ñ 
d the restricted response question requires the pupil 


at the recognition level an A 1 s ended 
to supply the answer. In comparison, the extended response question measures 


š s such as abilitie manira. 4 
arning outcome ilities to select, organize, integrate 


7 esponse question can measure a vari 
As noted earlier, P. riety of 


more general le 


184 Constructing Classroom Tests 


outcomes measured by each of these essay types in comparison to the objective 
interpretive exercise is presented in Table 10.1. 


Table 10.1 


TYPES OF COMPLEX LEARNING OUTCOMES MEASURED BY ESSAY 
QUESTIONS AND OBJECTIVE INTERPRETIVE EXERCISES 


Type of Examples of Complex Learning Outcomes that Can be 
Test Item Measured 
Ability to— 
recognize cause-effect relationship 
recognize the application of principles 
Objective recognize the relevance of arguments 
Interpretive recognize tenable hypotheses 
Exercises 


recognize valid conclusions 

Tecognize unstated assumptions 

recognize the limitations of data 

recognize the adequacy of procedures nize 
(and similar outcomes based on the pupils’ ability to recog 
the answer) 


and evaluate ideas. A summary comparison of the types of complex learning 
1 
| 


Ability to— 
explain cause-effect relationships 
describe applications of principles 
prsent relevant arguments 


Restricted formulate tenable hypotheses | 
Response formulate valid conclusions I 
Essay state necessary assumptions 
Questions describe the limitations of data 
explain methods and procedures ly the 
(and similar outcomes based on the pupils’ ability to suppiy I 
answer) l 
Ability to— f 
Extended produce, organize, and express ideas 
Response integrate learnings in different areas 
Essay create original forms (e.g., designing an experiment) 
Question evaluate the worth of ideas | 
The learning outcomes in Table 10.1 are of course merely suggestive of m I 
types of learning outcomes that can be measured. With slight modifications a | 
wording an infinite variety of outcomes could be stated in each area. It shou | 


N es- 
also be remembered that the freedom of response provided by the essay qu a 
tion is a matter of degree. Thus, functions of the restricted response Leet 
and the extended response question will tend to overlap near the middle of t 


range. 
ADVANTAGES AND LIMITATIONS OF ESSAY QUESTIONS u 


The main advantage of the essay question has already received considerable 
ie . 
emphasis—that is, it provides a direct measure of complex learning outcomes 


Complex. Achieuement: The Essay Test 185 


pai cannot be measured by other means. It should be pointed out, however, 
at the use of essay questions does not guarantee the measurement of complex 
achievement. To serve such purposes, the construction of essay questions must 


Teceive the same careful planning that goes into the construction of objective test 


ae The course objectives pertinent to complex achievement must be care- 
ully defined in terms of specific learning outcomes and the essay questions 


must be phrased so as to call forth the desired behavior. Where a table of 
for the test, it is, of course, simply a matter 


speci " i š w 
Pecifications is used in planning 
th the specifications indicated in 


of Constructing the questions in accordance wi 
the table. 

A second advantage confined largely to the extended response question, is 
the emphasis given to the integration and application of thinking and problem- 
Solving skills. Although objective items, such as the interpretive exercise, can 
be designed to measure various aspects of complex achievement, the ability to 
integrate and apply these skills in a general attack on a problem requires the 
Unique features of the essay question. In addition to the importance of measur- 
ing these outcomes such questions tend to have a desirable influence on pupils’ 
Study habits, Research studies in this area are rather old and are based on 
More general uses of the essay question than we are considering here, but they 
are in general agreement that pupils tend to direct their attention toward the 
integration and application of larger units of subject matter when essay ques- 
tions are included in classroom tests.” Questions designed specifically to measure 
Complex learning outcomes should have even a greater influence on these desir- 
able pupil learnings. ” 
; Since the pupil must present his answer in his own handwriting, the essay test 
is frequently looked upon as a device for improving writing skills. No one would 

isclaim that the ability to express oneself in writing is an educational objective 
of great significance. However: there is some question whether the tensions 
and pressures of test taking provide a desirable climate for developing writing 
skills. It would seem that written assignments which could be completed under 
More favorable conditions would contribute more to the attainment of this 
objective, ue 

Another commonly cited advantage of the essay question is its ease of con- 
struction, This factor, probably more than any other, has led to its widespread 
use by classroom teachers. In a matter of minutes, most teachers can formulate 
Several essay questions. This is an sae feature to the busy teacher who 
1S caught up in the pell-mell of daily activibes: This apparent advantage can 

very misleading, however. Constructing essay questions which call forth the 
Specific behaviors emphasized jn a particular set of learning outcomes requires 
considerable time and effort. Where ease of ‘construction is stressed, it usually 
Tefers to the all-too-frequent practice of dashing off questions at the last minute 
with little regard for the course objectives. In such cases there is some question 
whether ease of construction can be considered an advantage. In addition to 


“The Essay Type of Examination,” Educational Measurement. ed. E. F. 


PM. , 
Stalnaker, D.C.: American Council on Education, 1951). 


Lindquist (Washington, 


186 Constructing Classroom Tests 


the invalidity of the measurement, evaluating the answers to carelessly devel- 
oped questions tends to be a confusing and time-consuming task. 

The limitations of the essay test are so severe that it would probably be 
discarded entirely as a Measuring instrument were it not for the fact that it 
measures significant learning outcomes which cannot be measured by other 
means. The most serious limitation is the unreliability of the scoring. A ar 
of studies has shown that answers to essay questions are scored differently by 
different teachers and that even the same teachers score the answers differently 
at different times. Variations in scoring the same paper have been shown K 
range all the way from near perfect scores to those representing dismal failure- 
Such results hardly foster confidence in the essay test, In all fairness, Sa. 
it should be pointed out that in most studies reporting on the unreliability S 
scoring essay questions the learning outcomes being measured were not clearly 
identified. Evaluating essay questions without adequate attention to the lgani; 
ing outcomes being measured is comparable to the “three blind men appraising 
the elephant.” One teacher stresses factual content, one organization of p 
and another writing skill. With each teacher evaluating the degree a 
different learning outcomes are achieved, it is not surprising that they wei 
so widely in their scoring. Even variations in scoring by the same teacher a 
probably be accounted for to a large extent by inadequate attention to learni r 
outcomes. Where evaluation of answers is not guided by clearly defined w. 
comes, it tends to be based on less stable intuitive types of judement. Aiton 
the subjective process of scoring essay questions will always include ot 
uncontrollable variations, the scoring reliability can be greatly increased A 
clearly defining the outcomes measured, properly framing the questions, car 
fully following scoring rules, and obtaining practice in scoring.* wad 

A closely related limitation of essay questions is the amount of time ace 
for scoring the answers. If the scoring is dene conscientiously and he ag 
comments are written on the papers, even a small number of papers may aes 
several hours of scoring time. Where classes are large and a number of kan; 
questions are used in a test, conscientious scoring becomes practically a 
ble. Ironically, most of the suggestions for improving the scoring of essay T 
tions require more time, not less as might be hoped for. The only practi 3 
solution seems to be to reserve the use of essay questions for those — 
outcomes that cannot be measured objectively. With fewer essay questions 
Score in a given test, more time will be available for a careful reading an 
evaluation of the answers. . . . . s 

Another shortcoming of essay questions which restricts their efficient use wi 
the limited sampling they provide. So few questions can be included in a aa 
test that some areas are measured intensively while many others are neglecte š 
This inadequate sampling makes essay questions especially inefficient for measur 

*N. M. Downie, Fundamentals of Measurement (New York: Oxford University Press, 
1958). 


+J. M. Stalnaker, “The Essay Type of Examination,” Educational Measurement, ed. E. F. 
. M. Stalnaker, e Ess ” $ 
Lindquist (Washington, D.C.: American Council on Education, 1951). 


Complex Achievement: The Essay Test 187 


ing knowledge of factual information. For such outcomes we can use objective 
test items, reserving essay questions for measuring complex achievement. This 
does not eliminate the sampling problem, however, since we would also like 
an adequate sample of complex behaviors. Here, where we must use essay ques- 
tions despite their limited sampling, special efforts should be made to obtain 
as representative a sample as possible. One way of doing this is to accumulate 
evidence from a series of essay questions administered at different times through- 


Out the school year. 


SUGGESTIONS FOR CONSTRUCTING ESSAY QUESTIONS 


The improvement of the essay question as a measure of complex learning 
outcomes requires attention to two basic problems: (1) how to construct essay 
Questions which call forth the desired behavior and (2) how to score the 
answers so that a reliable measure of the achievement is attained. Here we 
shall confine ourselves to suggestions for constructing essay questions. In the 
next section, we shall consider suggestions for improving scoring. While this 
Provides a convenient organization for discussion purposes, it should be noted 
interrelated. 


t 
hat these two procedures are closely 
to those learn 


l. Restrict the use of essay questions ; 9 
e satisfactorily measured by objective items. Other things being equal, we shall 


always favor objective measurement over subjective measurement. There appears 
to be little justification for using essay questions to measure learning outcomes 
Which can be satisfactorily measured by more objective means. By the same 
token, the problems of scoring and the inadequacy of sampling provide ample 
Justification for not using essay questions in those areas. It is where other 


thi i is most desirable. Ç 
ings are not equal, that the use of essay questions is most desi able. Where 
ring the learning outcomes, the use 


Obieatiun š A 
Jective items are inadequate for measu: aea 
heir limitations. Some of the complex 


of essay questions tan be defended despite t AS. BORNE ° 
learning outcomes, such as those pertaining to the organization, integration, and 
a 


expression of ideas, will be neglected unless essay questions are used. It is by 
restricting its use ik these areas that its unique contribution to the evaluation 
g 


ing outcomes which cannot 


of w £ fully realized. 
pupil be most fully 
pil achievement can n will call forth the behavior specified in the 


2. Formulate questions which aeons ç 
earning outcomes. As with objective items, essay questions should measure 


the achievement of clearly 


are instructional goals. If the ability to apply 
Principles is being measured, for example. the questions should be phrased in 
such a manner that they call forth that particular behavior. Essay questions 
should never be hurriedly constructed in the vain hope that they will measure 

Jentified) educational goals. Each essay question 


broad, important (but unident ‘cul . 
should be carefully designed to elicit particular aspects of behavior defined in 


the desired learning outcomes 


M " stior 
Constructing essay questio t " 
is much easier with restricted response questions than with extended response 


questions. The limits placed on the scope of the topic and the type of response 


ns in accordance with particular learning outcomes 


188 Constructing Classroom Tests 


expected make it possible to relate a question directly to one or more of the 
specific outcomes. In the case of the extended response question, the extreme 
Íreedom makes it difficult to phrase the question so that the pupils' responses 
will reflect the particular learning outcomes desired. This difficulty can be par- 
tially overcome by indicating the bases on which the answer will be evaluated. 


EXAMPLE 


Write a two-page statement defending the importance of conserving our natural eee 

“ Š x as ose P 
(Your answer will be evaluated in terms of its organization, its comprehensiveness, and th 
relevance of the arguments presented.) 


Informing the pupils that they should pay special attention to organization, 
comprehensiveness, and the relevance of the arguments adds structure to the 
task and makes it possible to key the item to a particular set of learning out- 
comes. These directions alone will not, of course, provide assurance that the 
appropriate behaviors are exhibited. It is only when the pupils have been 
specifically taught how to organize ideas, how to treat a topic comprehensively; 
and how to present relevant arguments that such directions serve their intended 
purpose. 

3. Phrase each question so that the pupil’s task is clearly indicated. The 
specific purpose a teacher had in mind when formulating a question is fre- 
quently not conveyed to the pupil because of the vague and ambiguous phrasing 
of the question. As a result pupils interpret the question differently and a hodge- 
podge of answers is received. Since it is impossible to determine which of the 
incorrect answers are due to misinterpretation and which to lack of achievement, 
the results are worse than worthless, They may actually be harmful if used as a 
measure of pupil progress toward instructional goals, 

One way to clarify the nature of the question is to make it as specific 35 
possible. For the restricted response question, this means limiting and struc 
turing the question until the desired response is clearly defined. 


EXAMPLE 


Poor: Why do birds migrate? 


: i i š : in the 
Better: State three specific hypotheses which might explain why birds migrate south in t 


fall of the year. 


The improved version of the above item presents the pupils with a definite 
task to perform. Although some pupils may not be able to provide the answer, 
they will all certainly know what type of response is expected. Note also how 
easy it would be to relate such an item to a specific learning outcome such as 
“the ability to formulate tenable hypotheses.” 

Where an extended response question is desired, some limitation of the ques: 
tion may be possible but care must be taken not to destroy its unique function: 
If it becomes too limited and structured, it will be less effective as a measure 
of the ability to select, organize, and integrate ideas. The best procedure for 


; 
' 


Complex Achievement: The Essay Test 189 


clarifyi r " 

s the extended response question seems to be the one suggested earlier. 
is, provide the pupil with explicit directions concerning the type of re- 
sponse desired. | 


EXAMPLE 


a. Lai the Democratic and Republican parties. 

ompare the current policies of the Democratic and Republican parties with regard 
to the role of government in private business. Support your statements with specific 
examples, where possible. (Your answer should be confined to two pages- It will be 
evaluated in terms of the appropriateness of the facts and examples presented, and 


the skill with which it is organized.) 

ides no common basis for responding 
aluating the answer. Even if the 
“ability to organize,” greater 


pis version of the above item prov 
sully = no frame of reference for ev 
Ta ning outcome being measured was the ) a ° 
Ries LTS be needed. Where pupils interpret a question differently their 
ofthe co ul be organized differently because organization is partly a function 
Before ntent being organized. Also, some pupils would delimit the problem 
orrn answering and thereby provide themselves with a much easier task of 

ganization than that of pupils who attempted to treat the broad aspects of 


the problem. 
The improved version provides pupils with a clearly defined task without 


de: ; i i 

Ü ines their freedom to select and organize the answer. This is achieved 

ey E limiting the scope of the question and by including directions concern- 
Ë the type of answer desired. 


4. ; "P. > 
Indicate an approximate time limit for each question. Most essay ques- 


tions se š 
place a premium on speed of writing because inadequate attention is paid 


“So limits during the construction of the test. As each question is constructed, 
res ksi should estimate the approximate time needed for a satisfactory 
to ae In judging the response time to be allotted to a question, it 1s well 
a baa mind the speed of writing of the slower pupils in class. ‘Most errors 
is Cao the amount of time needed are in the direction of too little time. It 
etter to use fewer questions and mor limits than to put some 

st aka distinct disadvantage. 
so ie time limits allotted to each 
of ma pace their writing on eac 
both nas time with “just one more q 
ear pe and essay questions, the pu 
or me te much time to spend on each part oft 
Create uded on the test form itself. In either case, ca 
very ee about the time factor. The adequacy 
might aa be emphasized in the introductory remarks to a 

ont arise. 

e void the use of optional questior 
Biden — is to provide pupils w 
o answer and to permit them to cho 


e generous time 


be indicated to the pupils, 
h question and not be caught at the end 
uestion to go.” Where the test contains 
ould, of course, be told approxi- 
he test. This may be done orally 
re must be taken not to 
of the time limits might 
llay any anxiety that 


question should 


pils sh 


ions. A fairly common practice in the use 


ith more questions than they are 
ose a given number. The teacher, 


190 Constructing Classroom Tests 


for example, may include six essay questions in a test and direct pupils to write 
on any three of them. This practice is generally favored by pupils because they 
can select those questions they know most about. Except for the desirable effect 
on pupil morale, however, there is little to recommend the use of optional 
questions. ' 

Where pupils answer different questions, it is obvious that they are taking 
different tests and the common basis for evaluating their achievement is lost. 
Each pupil is demonstrating the achievement of different learning outcomes. 
As noted earlier, even the “ability to organize” cannot be measured adequately 
without a common set of responses because organization is partly a function 
of the content being organized. 

The use of optional questions might also influence the validity of the test 
results in still another way. When pupils anticipate the use of optional questions 
they can prepare answers on several topics in advance. commit them to memory: 
and then select questions where the answers are most appropriate. During such 
advance preparation, it is also possible, of course, for them to obtain help from 
others in selecting and organizing their answers. Needless to say, this provides 
a distorted measure of the pupils’ achievement. It also tends to have an undesir- 
able influence on study habits, since intensive preparation in a relatively few 
areas is encouraged. 


SUGGESTIONS FOR SCORING ESSAY QUESTIONS 

Provisions for improving the reliability of scoring answers to essay ques- 
tions begin long before the test has been administered. The first step is when the 
learning outcomes are clearly defined in behavioral terms. This is followed by 
a careful phrasing of the questions in accordance with the learning ae 
and the inclusion of explicit directions concerning the types of answers eon 
It is only when both the pupils and the teacher have a clear notion of the “ue 
to be performed that reliable scoring can be expected. No degree of proficiency 
in evaluating answers can compensate for poorly phrased questions. n 

When the necessary preliminary steps have been taken in constructing essay 
questions, the following suggestions can be used effectively to increase = 
reliability of the scoring. 

1. Prepare an outline of the expected answer in advance. This should zoe 
tain the major points to be included, the characteristics of the answer (6-87 
organization) to be evaluated, and the amount of credit to be allotted to each 
For a restricted response question calling for three specific hypotheses, = 
example, a list of acceptable hypotheses would be prepared and a given numbe 
of scoring points would be assigned to each. For an extended response questio™ 
the major points would be outlined. In addition, the relative amount of credit 
to be allowed for such characteristics as the accuracy of the factual information» 
the pertinence of the illustrative examples, and the skill of the organization 
would be indicated. ils’ 

Preparing a scoring key provides a common basis for evaluating the pupils 
answers. This increases the likelihood that our standards for each question w! 


Complex Achievement: The Essay Test 191 


remai z 
cs arger Gage i “ee If prepared during the construction of 
to th : Una y also helps us phrase questions which clearly convey 
3 e pupils the types of answers expected. 
Rs š sc method which is most appropriate. There are two common 
Themateg are oe questions. One is called the point method and the other 
massa a . i the point method, each answer is compared to the ideal 
adequar a sapos. ey and a given number of points assigned in terms of the 
of pane e og ww With the rating method, each paper is placed in one 
md coms es a the answer is read. These piles represent degrees of quality 
e a the credit assigned to each answer. If eight points are allotted 
adietan a for example, the pile representing the highest quality might be 
Usuall a t points, the next six, the next four, the next two, and the last none. 
Ree kan three and five categories are used with the rating method. 
mi a questions can generally be satisfactorily scored by the 
. The restricted scope and the limited number of characteristics 
it possible to define degrees of quality suf- 
alues. The extended response question, how- 
ly gross judgments can be made 
ation of the material, and similar 
ing such 


TR in a single answer make 
Sye ete ze to assign point v. 
panes = À requires the rating method. On 
snme the relevance of ideas, the organiz 
es evaluated in answers to extended response questions. Classify 
ristics into five categories 1s probably as precise as we can expect to be. 

Where the rating method is used, it is desirable to make separate ratings for 


eac. =ar N 
h characteristic evaluated. That is, the answers should be rated separately 
relevance of ideas, and the like. This 


ases the diagnostic value of the results. 
handling factors which are irrelevant to the 
d. There are a number of factors that influence 
vhich are not directly pertinent 
t among these are legibility of 
and neatness. We should 
ng our judgment when 
such factors may, of 
arate score 


NI slr saga comprehensiveness, 
3. Decide greater objectivity and incre: 
hain o e on provisions for 
tis Ba! utcomes being measure: a 
fö: thes uations of answers to essay questions v 
iaa e of the measurement. Prominent a 
make a ing, spelling, sentence structure, punctuation, x 
Me P rai effort to keep such factors from influenci 
irse ia the content of the answers. In some instances, 
fox oa. evaluated for their own sake. Where this is done, a sep š 
en expression or for each of the specific factors should be obtained. 
cht possible, however, we should not let such factors contaminate the 
o which our test scores reflect the achievement of other specific learning 

Outcomes, 
sasana decision concerns the presene i perua 
rmation in the answer. Should you ignore it and score only that which is 
Aa and correct? If you do, some pupils will write everything that occurs 
This re knowing that you will sort out and give credit for anything correct. 
other hen he, careful thinking and desirable evaluative abilities. On the 
Ginter š ; = you take off points for irrelevant and inaccurate material, the 
One, Bea. ow much to lower the score on a given paper 1s a troublesome 
ably the best procedure is to decide in advance approximately how 


e of irrelevant and inaccurate factual 


192 Constructing Classroom Tests 


much the score on each question is to be lowered where the inclusion of irrele- 
vant material is excessive. The pupils should then be warned that such a penalty 
will be imposed. 

4. Evaluate all answers to one question before going on to the next question. 
One factor which contributes to unreliability in the scoring of essay questions 
is a shifting of standards from one paper to the next. A paper with average 
answers appears to be of much higher quality when it follows a failing paper 
than when it follows one with near perfect answers. One way to minimize this 
influence is to score all answers to the first question, then all answers to the 
second question, and so on, until all of the answers have been scored. A more 
uniform standard can be maintained with this procedure, since it is easier to 
keep in mind the basis for judging each answer and answers of various degrees 
of correctness can be more easily compared. Where the rating method is used 
and the papers are placed in several piles on the basis of each answer, shifting 
standards can also be checked by reading each answer a second time and 
reclassifying where necessary. 

Evaluating all answers to one question at a time helps counteract another 
type of error that creeps into the scoring of essay questions, Where we evaluate 
all of the answers on a single paper at one time, the first few answers create a 
general impression of the pupil’s achievement which colors our judgment con 
cerning the remaining answers. Thus, if the first answers are of high quality; 
we tend to overrate the following answers, while if they are of low quality we 
tend to underrate them. This “halo effect” is less likely to form when the 
answers for a given pupil are not evaluated in continuous sequence. 

5. Evaluate the answers without looking at the pupils’ names. The general 
impressions we form about each pupil during our teaching is also a source 
of bias in evaluating essay questions. It is not uncommon for a teacher to g!V@ 
a high score to a poorly written answer with the rationalization that “the pup1 
really knows that material even though he didn’t express it too clearly.” 
similar answer by a pupil regarded less favorably will receive a much lower 
score with the honest conviction that the pupil got everything he deserved. This 
form of “halo effect” is one of the most serious deterrents to reliable scoring 
by classroom teachers and is especially difficult to counteract. 

Where possible, the identity of the pupils should be concealed until all answers 
are scored. The simplest procedure for achieving this is to have the pupils put 
their names on the back of the papers. In some cases, such as where our curiosity 
cannot be easily controlled, it is better to identify the papers by numbers rather 
than names. Where the identity of a pupil cannot be concealed because ° 
familiar handwriting, the best we can do is make a conscious effort to eliminate 
any such bias from our judgment. a 

6. If especially important decisions are to be based on the results, obtain 
two or more independent ratings. Sometimes essay questons are included in 
tests used to select pupils for awards, scholarships, special training, and the 
like. In such cases, two or more competent persons should score the papers 
independently and their ratings compared. After any large discrepancies have 


Complex Achievement: The Essay Test 193 


been satisfactorily arbitrated, the independent ratings may be averaged for 
more reliable results. 


SUMMARY 


ng those aspects of com- 


The essay question is especially useful for measuri 
e objective means. These 


plex achievement which cannot be measured by mor 
include (1) the ability to supply rather than merely recognize interpretations 
and applications of data, and (2) the ability to select, organize, and integrate 
Outcomes of the first type are measured 


ideas in a general attack on a problem. 
by extended 


by restricted response questions and outcomes of the second type 
response questions. 

Although essay questions provide a direct measure of s 
Outcomes, they have several limitations which severely restrict their use: (1) 
the scoring tends to be unreliable, (2) the scoring is time consuming, and 
(3) a limited sampling of achievement is obtained. Because of these shortcom- 
Fa it is suggested that essay questions be limited to testing those outcomes 
that cannot be measured by objective items. 

The “construction and scoring of essay 
which require careful attention if a valid 
is to be obtained. Questions should be so phrased that they measure the attain- 
ment of definite learning outcomes and clearly convey to the pupils the type 
of response expected. Indicating an approximate time limit for each question 
and avoiding the use of optional questions also contributes to more valid 
results, Scoring procedures can be improved by (1) using a scoring key, (2) 
adapting the scoring method to the type of question used, (3) controlling the 
influence of irrelevant factors, (4) evaluating all answers to each question at 
One time, (5) evaluating the answ he pupils’ names, and 


(6 ers without looking at tl 
) obtaining two or more independent ratings where important decisions are 
to be made. 


ignificant learning 


questions are interrelated procešses 
and reliable measure of achievement 


SUGGESTIONS FOR FURTHER READING 
Ahmann, J. S., and M. D. Glock. Evaluating Pupil Growth. 2nd edition, Boston: Allyn and 


Bacon, 196: w ” 
» 1963. Chapter 5: “Informal Essay ‘Achievement Tests. ' i 
bel, R. L., and Dora E. Damrin. “Tests and Examinations,” Encyclopedia of Educational 


Research. 3rd edition, New York: Macmillan, 1960. Pages 1503-1506. Summarizes re- 


5 search on essay testing. $ 
een, J. A. Teacher-Made Tests. New York: Harper & Row, 1963. Chapter 5: “Construc- 
tion and Use of E a 
ssay Tests. P 
stal ent and Evaluation in Psychology and 


Thorndike, R. L. and Elizabeth Hagen. Measurem 
Education. New York: John Wiley & Sons, 1961. Chapter 3: 
(pages 41-56). 

od, Dorothy A. Test Construction. Columbus, Ohio: Charles E. Merrill Books, 1960. 
Chapter 10: “The Essay Test.” 


“The Teacher’s Own Tests” 


SSMS 
Chapter !H 
preparing, 

administering, 
and appraising 
classroom tests 


Classroom tests are most effective when attention is given to the quality 


of the individual test items . . . the orderly arrangement of the items > 
the clarity of the directions . . . and the procedures for administering an 
scoring. . . . Classroom tests can also be improved by applying simple 


methods of item analysis . . . and by building a test-item file. 


In the preceding chapters, the first two steps in constructing classroom Festa 
have been emphasized. These are: (1) planning the test, including the prepara 
tion of a table of specifications and a recognition of the general principles ar 
procedures guiding test construction; and (2) constructing individual test items 
in accordance with the learning outcomes indicated in the table of specifica- 
tions. These two steps have received considerable attention because they aré 
crucial to the validity of a test. The only way we can have any certainty that @ 
classroom test will serve its intended purpose is to identify the learning outcome 
we wish to measure and then to construct test items which call forth the specific 
behavior described in the learning outcomes. Our work does not stop ner 
however. We must also assemble the items into a test, prepare directions, admin- 
ister the test, score the test, and interpret and appraise the results. 

Care in planning a test and in constructing individual test items should be 
followed by similar care in preparing the test for use, in administering the test, 
and in scoring and appraising the results. Lack of attention to such factors ine 
have an adverse effect on the validity of the results. In the final analysis. vali 
achievement testing is the end product of a systematically controlled series 0 


194 


Preparing, Administering, and Appraising Tests 195 


steps beginning with the identification of objectives and ending with the scoring 
and interpretation of results. Although validity is “built in” during the con- 
struction of the test items, systematic procedures of assembly, administration, 
and scoring are necessary to assure that the test items will function with maxi- 
maung effectiveness. Appraising test items after a preliminary tryout can also 
improve the quality of the items by indicating how each item actually does 
function in measuring pupil achievement. 


PREPARING THE TEST FOR USE 


The preparation of test items for use in a test is greatly facilitated if the items 
are properly recorded, if they are written at least several days before they are 
° be used, and if extra items are constructed. This simplifies the task of review- 
ing, selecting, and arranging the items in final test form. Writing test items 
early makes it possible to put them aside for a time before reviewing them for 
defects. Constructing extra items makes it possible to eliminate those items 
found to be defective. It also provides some latitude in fitting the final draft of 


the test to the table of specifications. 


Recording Test Items 


As test items are being constructed it is desirable to write each item on a 
Separate 5 X 8 card. In addition to the test item, the card should contain infor- 
mation concerning the specific learning outcome and subject-matter content 
measured by the item. A space should also be reserved on the card for item 
analysis information and comments concerning the effectiveness of the item. 
An illustrative card containing this information is presented in Figure 11.2. 

Placing each item on a separate card provides the flexibility needed in pre- 
Paring a test for use. As the items are being reviewed and edited, items can 
be eliminated, added, or revised with very little difficulty. The same holds true 
When arranging the items for the test. They can be arranged and rearranged 
merely by sorting the cards. The flexibility of this recording system also makes 
it possible to build a card file of effective test items for future use. The specific 
Procedure for building such an item file will be described later in this chapter. 


Reviewing and Editing Test Items 

en prepared, defects inadvertently 
Creep in during construction. As W the clarity and conciseness 
ofa question; a verbal clue slips in unnoticed. As we attempt to increase the 
difficulty of an item, we unwittingly introduce some ambiguity. As we rework 
an item to make e Menned choices more plausible, the behavior called forth 
by the item is unintentionally modified. In short, we focus our attention so 
closely on some aspects of item construction that we overlook others. This 
results in an accumulation of unwanted errors. Such technical defects can most 
easily be detected by (1) reviewing the items after they have been set aside 
for a few days, and (2) asking a fellow teacher to review and criticize the items. 


No matter how carefully test items have be 
ve concentrate on 


196 Constructing Classroom Tests 


In reviewing test items, it is desirable to put yourself in the role of the pupil 
taking the test. From this vantage point, each item should be read carefully 
and a judgment made concerning the type of response called forth by the item. 
At the same time, the item should be surveyed carefully for technical defects. 
The following questions will aid in analyzing the quality of each item. 

1. Is the aspect of knowledge, understanding, or thinking skill called forth by 
the item in harmony with the specific learning outcome and subject-matter 
content being measured? Where a table of specifications has been used as a 
basis for constructing the test items, this is merely a matter of checking to see 
whether the item still relates to the same cell in the table. If the functioning 
content of an item has shifted during construction, the item should either be 
modified so that it serves its original purpose or reclassified in light of the new 
purpose served by the item. In any case, the response called forth by an item 
should be in agreement with the purpose for which the item is to be used. 

2. Ís the point of the item clear? A careful review of test items frequently 
reveals ambiguity, inappropriate vocabulary. and awkward sentence struc- 
ture which were overlooked during the construction of the items. It seems 
that returning to test items after they have been set aside for a few days provides 
a fresh outlook which makes such defects more apparent. The difficulty of the 
vocabulary and the complexity of the sentence structure must, of course, be 
judged in terms of the maturity level of the pupils. However, at all levels, special 
efforts must be directed toward the removal of ambiguity. In its final form, each 
item should be so clearly worded that all pupils understand the task. Whether 


_ a pupil responds correctly should be determined solely by whether he possesses ` 


the knowledge or understanding being measured. 

3. Is the item free from irrelevant clues? As noted earlier, an irrelevant clue 
is any element which leads the nonachiever to the correct answer and thereby 
prevents the item from functioning as intended. These include (1) grammatical 
inconsistencies, (2) verbal associations, (3) specific determiners (ie., words 
such as always and never), and (4) some mechanical features, such as correct 
statements tending to be longer than incorrect ones. Most of these clues can be 
removed merely by making a deliberate attempt to detect them during the item 
review. The suggestions for constructing each of the item types provides a 
excellent source of specific points to consider in searching out such clues. 

Where it is possible to get a fellow teacher to review the test items, he shoul 
be asked to read each item, indicate his answer, and note any technical defects 
in the item. If his answer does not agree with the key, ambiguity may be indi- 
cated. Asking him to “think out loud” as he determines the answer will usually 
reveal his interpretation of the question and the source of the ambiguity. This 
is the area in which another reviewer can be most useful. He will be less helpfu 
in evaluating the type of response called forth by the item, since this requires 
a knowledge of what the pupils have been taught. Only the teacher knows for 


sure whether an item measures understanding or merely the retention oe 
previously learned answer. 


á mam EEE sas [Tn sea Wa www 


Preparing, Administering, and Appraising Tests 197 


When the test items have been revised and those to be included in the test 
have been tentatively selected, a final check should be made in terms of the 
following questions: 


1. Do the test items measure a representative sample of the learning out- 


comes? 

2. Do the test items measure 
of course content? 

3. Are the test items adapted in difficulty to the 
the pupils for whom the test is intended? 

4. Are the test items free from overlapping 
does not provide a clue to the answer in another) ? 
d by comparing the final selection of 
to the last two are determined by 
to these questions mean 


a representative sample of the various phases 


purpose of the test and to 


(so that the information in one 


" The first two questions can be answere 
items with the table of specifications. Answers 
reviewing the test items as a whole. Affirmative answers 
the items are ready to be arranged in final test form. 


Arranging the Items in the Test 


There are various methods of grouping items in an achievement test and 
the method will vary somewhat with the use to be made of the results. For most 
classroom purposes, however, a satisfactory arrangement of items can be 
obtained by a systematic consideration of the following factors: (1) the types 
of items used, (2) the learning outcomes measured, (3) the difficulty of the 
Pona and (4) the subject matter measured. P 

First and foremost, the items should be arranged in sections by item type. 
That is, all true-false items should be grouped together, then all matching items, 
then all multiple-choice items, and so on. This arrangement provides for the 
fewest sets of directions; it is easier for the pupils since they can retain the 
Same mental set throughout each section; and it greatly facilitates scormg. 
Where two or more item types are included in a test, there is also some advan- 


tage in keeping the simpler item types together and placing the more complex 


it : 
em types last in the test, as follows: 


True-false or alternative-response items. 
Matching items. 

Short-answer items. 

Multiple-choice items. 

Interpretive exercises. 

- Essay questions. 


AaB s o P 


st in the above order provides a sequence 
of the learning outcomes measured, 
then merely a matter of grouping 
items which measure similar 


nged in order of ascending 


Arranging the sections of the te 
pa roughly approximates the complexity 
the Sog from the simple to the complex. It is 

items within each item type. For this purpose, 
Outcomes should be placed together and then arra 


198 Constructing Classroom Tests 


difficulty. For example, the items in the multiple-choice section might be 
arranged in the following order: (1) knowledge of terms, (2) knowledge of 
specific facts, (3) knowledge of principles, and (4) application of principles 
Keeping the items which measure similar outcomes together is especially helpful 

in determining the types of learning outcomes causing pupils the greatest dif- 
ficulty. 

If for any reason it is infeasible to group the items by the learning outcomes 
measured, it is still essential that the items be arranged in order of increasing 
difficulty. Beginning with the easiest items and proceeding gradually to the 
most difficult items has a motivating effect on pupils. Also, encountering difficult 
items early in the test frequently causes pupils to spend a disproportionate 
amount of time on such items. If the test is long, they may be forced to omit 
later questions which they could easily have answered. í 

With the items classified by item type, as has been suggested, an order = 
increasing difficulty can be obtained by arranging the sections of the test an 
by arranging the items within each section. The sequence of item types spg; 
gested above, beginning with true-false items and ending with essay Casada: 
provides a general order of ascending difficulty for arranging the sections z 
the test. Some shifts in the first four item types may be warranted by the s 
ficulty of the specific items used, but interpretive exercises and essay tests shou 
certainly be last. When the separate sections are arranged in sequence, the items 
within each section should then be placed in order of inereasing difficulty. 

In constructing classroom achievement tests, there is little to be gained by 
grouping test items in terms of subject-matter content. Where it appears 
desirable to do so, such as in separating historical periods, these divisions 
should be kept to a minimum. Organizing items by subject-matter content ue 
probably most useful in mastery and diagnostic tests. Here the emphasis i8 on 
minimum essentials and the identification of specific learning errors. In addition, 
the level of difficulty of the items is of little significance in these tests so subject- 
matter groupings can be more easily applied. e 

To summarize, the most effective method for organizing items in the m 
classroom test is to (1) form sections by item type, (2) group the items witbir 
each section by the learning outcomes measured, (3) arrange both the EEA 
and the items within sections in an ascending order of difficulty. Use subjec 


: ; f such 
matter groupings only where they are needed for some specific purpose; 
as diagnosis. 


Preparing Directions for the Test 


Teachers frequently devote considerable time and attention to the aes 
tion and assembly of test items and then dash off directions with very litt 
thought. In fact, many teachers include no written directions with their pation 
assuming either that the items are self-explanatory or that the pupils are gon 
ditioned to answering the types of items used in the test. Oral directions on 
of course, also used by some teachers, but they all too frequently leave muc 


Preparing, Administering, and Appraising Tests 199 


to be desired. Whether written, oral, or both, the directions are a vital part 
of the test and should include at least the following points’: 


1. Purpose of the test. 

2. Time allowed for answering. 

3. Basis for answering. 

4. Procedure for recording the answers. 
5. What to do about guessing. 


The amount of detail devoted to each of these points depends mainly on the 
age level of the pupils, comprehensiveness of the test, complexity of the test 
items, and the experience of the pupils with the testing procedure used. Use 
of new item types and separate answer sheets, for example, requires much more 
detailed directions than familiar items requiring pupils merely to circle or 
underline the answer. 

Purpose of the Test. The purpose of the test is usually indicated at the 


time the test is announced or at the beginning of the semester when the evalua- 


tion procedures are described as a part of the general orientation to the course. 
f the test is clear to all pupils, 


Should there be any doubt whether the purpose o ‘pi 
however, it might better be explained again at the time of testing. This is 
usually done orally. The only time a statement of the purpose of the test needs 
to be included in the written directions is when the test is to be administered 
to several sections taught by different teachers. Here a written statement of pur- 
pose assures greater uniformity. 

Time Allowed for Answering. It is desirable to in 
much time they will have for the total test and how to distribute their time on 
each of the parts. Where essay questions are included, it is also desirable to 
indicate approximately how much time should be allotted to each question. This 
enables the pupils to use their time most effectively and prevents the less able 
Pupils from spending too much time on questions that are particularly difficult 
for them. 2 

Classroom tests of achievement should - 
Except Íor special purposes, such as measuring proficiency in shorthand, typing, 
and simple computational skills, speed is not an important factor. Our main 
Concern is the level of achievement each pupil has attained. Were it not for 
Practical considerations like the length of class periods and the pressure of 
other school activities, there would be no need for any time limits with most 
general achievement tests. 

: Judging the amount of time pupils wi 
simple matter. It depends on the types of item 


Pupils, and the complexity of the learning outcomes mea š 
the average high school pupil should be able to answer two true-false items, 


dicate to the pupils how 


generally have liberal time allowances. 


Il need to complete a given test is not a 
s used, the age and ability of the 
sured. As a rough guide, 


ments (New York: Longmans, Green, 1958) . 


1 
E. J. Furst, Constructing Evaluation Instru 
Tests (New York: Odyssey Press, 1950). 


R. M. W. Travers, How to Make Achievement 


200 Constructing Classroom Tests 


one multiple-choice item, or one short-answer item per minute oÍ testing time. 
Interpretive test items would take much more time; the exact amount depends 
on the length and complexity of the introductory materials. Also, elementary 
pupils generally require more time per item than high school pupils and reading 
skill is an important determiner of the amount of time needed by a specific 
group. Experienced teachers familiar with the ability and work habits of a given 
group of pupils are in the best position to judge time allotments. Where such 
experience is lacking, it is better to err in the direction of allotting too much 
time than to deprive some of the slower pupils from demonstrating their maxi- 
mum levels of achievement. k 

Basis for Answering. The directions for each section of the test should indi- 
cate the basis for selecting or supplying the answers. With true-false, matching, 
and multiple-choice items this part of the directions can be relatively simple. 
For example, a statement like “select the choice which best completes the 
statement or answers the question” might be sufficient for multiple-choice items. 
Where interpretive exercises are used, however, more detailed directions are 
necessary since the basis for the response is much more complex. The directions 
must clearly indicate the precise type of interpretation expected. As illustrate 
in Chapter 9, each interpretive exercise usually requires its own unique 
directions. 

It is sometimes desirable to include sample test items correctly marked s9 
that pupils can check their understanding of the basis for answering. This 
practice is especially desirable for elementary school pupils, and also for pupils 
at other levels where complex item types are used. 

As noted earlier, essay questions frequently require special directions con- 
cerning the type of response expected. If special emphasis is to be on the selec- 
tion and organization of ideas, for example, this should be indicated to the 
pupil so that he has a more adequate basis for responding. 

Procedure for Recording the Answers. Answers may be recorded on the 
test form itself or ‘én separate answer sheets. Where the test is short, or the 
number of pupils taking the test is small, or the pupils are relatively youn 
answers are generally recorded directly on the test paper. For most other situa- 
tions, separate answer sheets are preferred because they reduce the time neede 
for scoring, and they make it possible to use the test papers over again. The 
latter feature is especially desirable when the test is to be given to pupils a 
different sections of the same course. 

Directions for recording the answer on the test paper itself can be relatively 
simple. With selection items, it is merely a matter of instructing the pupils to 
circle, underline, or check the letter indicating the correct answer. For pupils 
in the primary grades, it is usually better to ask them to mark the answer 
directly by drawing a line under it. With supply items, the directions shoul 
indicate where to put the answer, and the units in which it is to be express® 
if the answer is numerical. 


p š " ir 
Separate answer sheets are easily constructed and the directions for thei 


Preparing, Administering, and Appraising Tests 201 


heet itself. A common 


use can be placed on the test paper or on the answer sl 
The directions on this 


type of teacher-made sheet is presented in Figure 11.1. 


Course Name 


Section Date 


Score: Bat l| _ 


Test 
Part II 


— x TP 


Totol 


— 


DIRECTIONS: Read all directions on the test paper carefully and follow them 


exactly. For ea item, indicate your answer on this sheet by crossing out 
y ich test item, y: 


the appropriate letter (x) or filling the appropriate blank. Be sure that the 


number on the answer sheet is the same os the number of the test item you 


are answering. 


Short-Answer 


True-False Multiple-Choice 


Item Answer 
Jtem Answer ltem Answer — oo 


+ F 21 ABCDE 4 


TF 22 ABCDE 42 


23 ABCDE 43 


24 ABCDE 44 


Figure 11.1. Top portion of a teacher-made answer sheet. 


Sheet are rather general, since they must cover instructions for recording various 
types of answers. It should also be noted that pupils are instructed to cross out 
rather than circle the letters indicating the correct answers. This is to facilitate 
Scoring with a stencil key. Circled letters cannot be readily seen through holes 
în a stencil. 


Penial answer sheets developed for 
cl i | 
assroom tests but there is no advantage in using 


machine scoring can also be used with 
them unless machine scoring 


202 Constructing Classroom Tests 


facilities are readily available and the number of papers to be scored warrants 
the expense. Where machine scoring is to be used, special directions should be 
obtained from the company. supplying the scoring service. An example of an 
answer sheet for machine scoring is presented in Chapter 13 (Figure 13.1). 

What to Do About Guessing. Where test items of the selection type are 
used, the directions should tell pupils what to do when they are uncertain of 
the answer. Should they guess or should they omit the item? If no instructions 
are given on this point, the bold pupils will guess freely while others will answer 
only those items of which they are fairly certain. The bold pupils will select 
some correct answers just by lucky guesses and thus their scores will be 
higher than they should be. On the other hand, if the pupils are merely ans 
structed “Do not guess” or “Answer only those items of which you are certain 
the more timid pupils will omit many items they could answer correctly. Such 
pupils aren’t very certain about anything and this uncertainty prevents them 
from responding, even when they are reasonably sure of the answers. With 
these directions, the bold pupils will continue to guess, although possibly not 
quite so wildly. 

As noted by Cronbach,” the tendency to guess or not to guess when in doubt 
about an item is determined by personality factors and cannot be entirely elimi- 
nated by directions which caution against guessing or which promise penalties 
to those who do guess. The only way to eliminate variations in the tendency e 

ÜZËuess is to instruct pupils to answer every item. When this is done, no pup» 
is given a special advantage and it is unnecessary to correct for guessing fi 
the scoring. Directions such as the following are usually sufficient to commun) 
cate this to the pupils: “Since your score is the number right, be sure to answer 
every item.” 

Some teachers object to such directions on the grounds that encouraging 
guessing is undesirable from an educational standpoint. Most responses w 
doubtful items are not wild guesses, however, but guesses guided by some infor- 
mation and understanding. In this respect, they are not too different from the 
“informed” guesses we make when we predict weather, judge the possible 
consequences of a decision, or choose one course of action over another. Prob; 
lem solving always involves a certain amount of this type of “informe 
guessing. 

A more defensible objection to directions which encourage guessing is that 
the chance errors introduced into the test scores lower the accuracy of measure 
ment. Although this is certainly objectionable, it probably has less influence ne 
the validity of the results than the systematic advantage given to the “pold 
guessers by the “do not guess” directions. 

For liberally timed classroom tests, the “answer-every-item” directions are 

~ definitely favored. However, for speed tests, and in situations where teachers 


; w adi ` ' o 
prefer to discourage guessing, directions such as the following provide a £° 
compromise: 


*L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row. 1960). 


, Preparing, Administering, and Appraising Tests 203 

nswer all items for which y: i i 

Maeres a which you can find some reasonable basis for answering, even though 
pletely sure of the answer. Do not guess wildly, however, since there will 


he a cı i guessi 
OTT ing.’ 
ection for guessing.3 


Ao `w an a trend in standardized testing toward the “make informed 

apasi ot wild guesses type of directions. It should be noted, however, 

See hele ne factor in standardized testing than in ordinary 

THEE i a ani that the test items are not as closely keyed to the learning 

sil haere cecal e pupils. Where pupils are familiar with the content of the test 

mgs apie Sppsenedtly to consider every item, there is generally no need 
against “wild” guessing or to attempt a correction for it. 


Reproducing the Test 


Mi een: ei test materials for reproduction it is important that the test 
With the noe and arranged so that they can be read, answered, and scored 
nanan st amount of difficulty. Cramming too many test items on a page is 
w esa aaa little paper is saved will not make up for the time and 
All waqa ich results during the administration and scoring of the test. 
ee “ane should have generous borders. Multiple-choice items should 
tithe alternatives listed in a vertical column beneath the stem of the item, 
io. das an across the page. Items should not be split with parts of the item on 
ona pages. With interpretive exercises, the introductory materials can 
"aman E placed on a facing page or separate sheet with all of the items 
Uni g to it on a single page. 
S en a separate answer sheet is used 
of ae side of the page, preferably th 
se a circling the letter of the correct ans 
ison s simply a matter of placing a strip scoring 
ie should be numbered consecut 
aa saya be identified during discussion o 
eaten te 1 38 item analysis. Where separate answer sheets are us 
The To of course, indispensable. 
kete B: ication of classroom fests is usuall 
used, Rer arge number of copies are desired, 
ted dae gardless of the reproduction process se 
med entire test before it is administered. Charts, 
ashes must be checked especially carefully to be cer 
n accurate and the details clear. 


G AND SCORING THE TEST 


aration of the test should be car- 
Here we are concerned with (1) 


, the space for answering should be 
e left. The most convenient method 
t answer. With this arrangement, 
key beside the column of 


ively throughout the test. Each test 
f the test and for other pur- 
ed, consecutive 


y by mimeograph or Ditto machine. 
the photo-offset method is also 
lected, it is desirable to proof- 
graphs, and other pictorial 
tain that the reproduction 


ADMINISTERIN 


The : i 
ried same care which has gone into the prep 
over i = N F 
ver into the administration and scoring. 


a Th 
he co; i š š š P Ç 
rrection-for-guessing formula and the rationale for its use will be discussed under 


Scori £ 
ng later in this chapter. 


204 Constructing Classroom Tests 


providing optimum conditions for obtaining the pupils’ responses, and (2) 
selecting convenient and accurate procedures for scoring the results. 


Administering the Test 


The guiding principle in administering any classroom test is that all pupils 
must be given a fair chance to demonstrate their achievement of the learning 
outcomes being measured. This means a physical and psychological environ- 
ment conducive to their best efforts and the control of factors which might 
interfere with valid measurement. 

Desirable physical conditions such as adequate work space, quiet, proper 
light and ventilation, and comfortable temperature are sufficiently well known by 
teachers to warrant little attention here. Of greater importance, but frequently 
neglected, are the psychological conditions influencing test results. Pupils will 
not perform at their best if they are overly’tense and anxious during testing; 
Some of the things which create excessive test anxiety are: 


1. Threatening pupils with tests, if they do not behave. 

2. Warning pupils to do their best “because this test is important.” 
3. Telling pupils they must work fast to complete the test on time. 
4. Threatening dire consequences if they fail the test. 


The antidote to test anxiety is to convey to the pupils, by both word and 

deed, that the test results are to be used to help them improve their learning. 
They should also be reassured that the time limits are adequate to allow them 
to complete the test. This, of course, assumes that the test will be used to 1m- 
prove learning and that adequate time limits have been provided. 
_ The time of testing can also be a factor influencing the results. If tests are 
administered just before “the big game” or “the big dance,” the results may 
not be representative. Furthermore, in the case of individual pupils, fatigue, the 
onset of illness, or worry about a particular problem may prevent maximum 
performance. Arranging the time of testing in terms of such factors and per- 
mitting the postponement of the test in individual cases, when appropriate; oe 
enhance the validity of the results. 

The actual administration of the test is a relatively simple matter, since a 
properly prepared classroom test is practically self-administering. Oral ae 
tions, if used, should be presented in a clear, concise manner. Any samp’? 
problems or illustrations put on the blackboard should be kept brief and simple- 
Beyond this, suggestions for administering a classroom test consist mainly ° 
things to avoid. 

l. Do not talk unnecessarily before the test. When a teacher announces that 
»there will be “a full forty minutes” to complete the test and then talks for he 
first ten minutes, pupils feel that they are being unfairly deprived of testinė 
time. Besides, just before a test is no time to make assignments, admonish the 
class, or introduce next week’s topic. Pupils are mentally set for the test an! 
will consciously ignore anything not pertaining to the test for fear it will hina 
the recall of information needed to answer the questions. Thus, the wes 


Preparing, Administering, and Appraising Tests 205 


intentioned remarks fall on “deaf ears” and merely increase anxiety toward 
the test and cause hostility toward the teacher. 
— 2. Keep interruptions during the test to a minimum. At times a pupil will 
ask to have an ambiguous item clarified and it will be desirable to explain the 
item to the entire group at the same time. Such interruptions are necessary 
but should be kept to a minimum. All other distractions, both from without 
and within the classroom, should, of course, also be eliminated where possible. 
It is sometimes desirable to hang a “Do not disturb—TESTING” sign on the 
outside of the door. 
s 3. Avoid giving hints to pupils who ask about individual items. If the item is 
ambiguous it should be clarified for the entire group, as indicated earlier. If it 
is not ambiguous, the pupil should be told to answer it as best he can. Refrain- 
ing from giving hints to pupils who ask for help is especially difficult for 
beginning teachers. However, providing unfair aid to some pupils (the bold, 
the apple polishers, and so on) decreases the validity of the results and lowers 
the morale of the class. n 
4. Discourage cheating, if necessary- Where good teacher-pupil rapport exists 
and the pupils view tests as helpful rather than harmful, cheating is usually not 
a problem, Under other conditions, however, it might be necessary to discourage 
cheating by special seating arrangements and careful supervision. Receiving 
unauthorized help from other pupils during a test has the same’ deleterious 
effect on validity and class morale as receiving special hints from the teacher. 
e are interested in each pupil doing his very best; but for valid results, his 
Score must be based on his own unaided efforts. 


Scoring the Test % 


Procedures for scoring essay questions were described in the last chapter. 
Here the discussion will be limited to the scoring of objective items. u 
If the pupils’ answers are recorded on the test paper itself, a scoring key is 
usually obtained by marking the correct answers on a blank copy of the test. 
e scoring procedure is then simply a matter of comparing the columnis of 
answers on this master copy with the columns of answers on each pupils’ paper. 
A strip key, which consists merely of strips of paper on which the columns of 
answers are recorded, may also be used if more convenient. These can be pre- 
Pared easily by cutting the columns of answers from the master copy of the 
test and mounting them on strips of cardboard cut from manila folders. | 
Where separate answer sheets are used, a scoring stencil is most convenient. 
This is a blank answer sheet with holes punched where the correct answers 
should appear. The stencil is laid over each answer sheet and the number of 
answer checks appearing through the holes are counted. When this type of 
Scoring procedure is used, each test paper should also be scanned to make 
Certain that only one answer was marked for each item. Any item containing 
More than one answer should be eliminated from the scoring. ' 
_ As each test paper is scored, it is desirable to mark each item that is answered 
Mecorrectly, With multiple-choice items, a good practice is to draw a red line 


206 Constructing Classroom Tests 


through the correct answer of the missed items rather than through the pupil’s 
wrong answers. This will indicate to the pupil those items he missed and at the 
same time will let him know what the correct answers are. Time will be saved 
and confusion avoided during discussion of the test. Marking the correct an- 
swers of missed items is especially simple with a scoring stencil. Where no 
answer check appears through a hole in the stencil, a red line is drawn across 
the hole. . 

sZ In scoring objective tests, each correct answer is usually counted one pout, 
This is done because an arbitrary weighting of items makes little or no differ- 
ence in how the pupils rank on the test. If some items are counted two ens 
some one point, and some one-half point, the scoring is more complicated m a: 
out any accompanying benefits. Scores based on such weightings will ‘conte ate 
highly with scores based on the simpler procedure of counting each item onè 
point. m m 

Where pupils are told to answer every item on the test, a pupils score 
simply the number of items he has answered correctly. There is no need to 
consider wrong answers or to correct for guessing. When all pupils answer ate 
item on a test, the rank order of the pupils’ scores will be the same whet 
the “number right” is used or a correction for guessing is applied. Some pond 
ers prefer to correct for guessing because they feel the resulting scores prov! 

a more accurate indication of the pupils’ actual achievement. As we shall see 
below, however, this is questionable. 

Correcting for Guessing. A correction for guessing is usually applied where 
pupils do not have sufficient time to complete all items on the test and a 
they have been instructed that there will be a penalty for guessing. The mos 
common formula used for this purpose is the following: 


Wrong 


Score = Right — 
= 


. . . : ¥ ula 
In this formula, n is the number of alternatives for an item. Thus, the form 
would apply to various selection-type items as follows: 


Ww 
True-false ite = Ma 
rue-false items S R T 
(or) 
S=R-W 
Multiple-choice items 
W 
(A) Three alternatives S=R— = 
k W 
(B) Four alternatives S=R— zi 
(C) Five alternatives S=R — á 


h 
š x š š m nt bot 
Use of a correction formula in the scoring makes it necessary to cou 


š : : ; , counte 
right and wrong answers. Items which were omitted by a pupil are not c 
in the scoring. 


t 
" x il does n° 
These correction-for-guessing formulas assume that when a pupil d 


s = d he 
know the answer to an item he guesses blindly among all alternatives an 


Preparing, Administering, and Appraising Tests 207 


selects the correct answer a given number of times on the basis of chance alone. 
Thus, if a pupil has 60 items right and 15 items wrong on a true-false test, it 
is assumed that he guessed blindly on 30 items on the test and had chance success 
in guessing (15 right and 15 wrong). The formula merely removes his lucky 
guesses from his score by subtracting the number wrong from the number right: 
Correct score = 60 — 15 = 45. 
. The same assumption is made in applying 
items but the possibility of selecting the correct answer is 
more alternatives from which to choose. For example, where a pupil has 60 
items right and 15 items wrong on a four-alternative multiple-choice test, it is 
assumed that he guessed blindly on 20 items and guessed successfully one- 
fourth of the time. Thus, his blind guessing resulted in 5 right answers and 15 
Wrong answers. To remove his lucky guesses from his score, it is simply a 
matter of subtracting one third of his wrong answers. This is what the correc- 
tion formula does, as illustrated below: 
y 15 
s=nR— S=60— = 55 

The correction-for-guessing (or correction for chance) formula provides a 
Suitable correction where the basic assumption can be satisfied—that is, that 
Pupils guess blindly when they do not know the answer. Such blind guessing 
seldom occurs in classroom testing, however. Some correct guesses are informed 


Suesses based on partial information and some wrong answers are due to mis- 


information or extremely plausible distracters. Where pupils can eliminate 
ke informed guesses among those 


so g bi 
me of the alternatives in items and ma 
hance success. Where pupils select 


Tenan a 

remaining, the formula undercorrects for cl 

oe alternatives because of misinformation or the plausibility of distracters, 
the formula overcorrects for chance success. Consequently, when the correction 


formula is used with classroom tests an unknown amount of error is introduced 
into the scoring. Although it is hoped that the two types of error will cancel 
each other out, there i ó way of determining the amount of distortion in 
the test scores, 

f Because of the questionable assumption on which the correction-for-guessing 
rmula is based, it is recommended that it not be used with the ordinary 
classroom test. The only exception is where the test is speeded to the extent 
that pupils complete different numbers of items. Here its use is defensible, since 
Pupils could increase their scores appreciably by rapidly (and blindly) guessing 
at the remaining untried items just before the testing period ended. 


APPRAISING THE TEST 
Afier the test has been scored, the usual practice is to discuss the results 
with the pupils and then to discard the test. Except for the pupil criticism during 
discussion, which helps identify some of the defective items, the teacher has 
little evidence concerning the quality of the test he used to measure achieve- 


the formula to multiple-choice 
less because there are 


` Educational Measurement, ed. E. F. Lind- 
on, 1951). 


P " ” 
Tat. B. Davis, “Item Selection Techniques, 
st (Washington, D.C.: American Council on Educati 


208 Constructing Classroom Tests 


ment. In addition, by discarding the test he is wasting much of the careful plan- 
ning and hard work which went into the preparation of the test. A more desirable 
procedure would be to appraise the effectiveness of the test items and to build a 
file of high quality items for future use. In a few years the file of items would 
be so extensive that items could be reused with a long enough time interval in 
between to prevent pupils from being familiar with the specific content of the 
items. 

— The effectiveness of each test item can be determined by analyzing the pupils 
responses to the item. This item analysis is usually designed so that answers to 
the following questions can be obtained. 


v 1. How difficult is the item? 
2. How well does the item discriminate between high and low achievers? 
3. How effective is each of the distracters? 


Answers to such questions are of obvious value in selecting or revising items 
for future use. The benefits of item analysis are not limited to the improvement 
of individual test items, however. There are a number of fringe bénefits of 
special value to classroom teachers. The most important of these are the fol- 
lowing: 

1. Item analysis data provide a basis for efficient class discussion of the test 
results. Knowing the difficulty of each test item and how effectively it func- 
tioned in measuring achievement makes it possible to confine the discussion to 
those areas which will be most helpful to pupils. Easy items that were answered 
correctly by all pupils can be omitted from the discussion and the concepts 
in those items causing pupils the greatest difficulty can receive special emphasis. 
‘Similarly, misinformation and misunderstandings, reflected in the choice ° 
particular distracters, can be corrected. Of no less significance is the fact that 
s item analysis will expose technical defects in items. During discussion defective 
items can be pointed out to pupils, saving much time and heated discussion 
concerning the unfairness of these items. If an item is ambiguous and Ds 
answers can be defended equally well, both answers might be counted correc 
and the scoring adjusted accordingly. Eo 

2. Item analysis data provide a basis for remedial work. Although discussinë 
the test results in class can clarify and correct many specific points, item analysis 
frequently brings to light general areas of weakness requiring more ete 
attention. In an arithmetic test, for example, item analysis may reveal that a 
pupils are fairly proficient in arithmetic skills but are having difficulty as 
problems requiring the application of these skills. In other subjects, item analys! 
may indicate a general weakness in knowledge of technical vocabulary, in an 
understanding of principles, or in the ability to interpret data. Such information 
makes it possible to focus remedial work directly on the particular areas o: 
weakness. 

3. Item analysis data provide a basis for the general improvement of class- 
room instruction. In addition to the above uses, which by themselves shoul 
contribute to improved instruction, item analysis data can assist in evaluating 


JA 


Preparing, Administering, and Appraising Tests 209 


the appropriateness of the specific learning outcomes and the course content for 
the particular type of pupils being taught. Material that is consistently too > 
simple or too difficult for the pupils might suggest curriculum revisions or 
shifts in teaching emphasis. Similarly, errors in pupil thinking which persistently 
appear in item analysis data might direct attention to the need for more effective 
teaching procedures. In these and similar ways, item analysis data can provide 
insights into instructional weaknesses and clues for their improvement. 


4. Item analysis procedures lead to increased skill in test construction. Item 
ctive distracters, and other technical 


a i se À 
nalysis reveals ambiguities, clues, ineffe 
aration of the test. This information 


defects that were missed during the prep 

is used directly in the revision of the test items for future use. In addition to 

the improvement of the specific items, however, we derive benefits from the 

` procedure itself. As we analyze pupils’ responses to items, we become increas- 
them. During revision 


i ; 6 $ 
Ëy cognizant of technical defects and the factors causing 
9Í the items, we obtain experience in rewording statements so that they are 


pa pigu, rewriting distracters so that they are more plausible, and modi- 
ying items so that they are of a more appropriate level of diffculty. As a 
consequence, our general test construction skills are appreciably increased by 


the experience. 


Item Analysis Procedure 

For most classroom tests, a simplified form of item analysis is all that is 
Senese or warranted. A suitable procedure is to compare the responses of 
papil ranking in the upper and lower thirds of the class on the basis of the 
total test score.’ The responses of pupils in the middle third are not included 
in the analysis but are assumed to follow the same trend as those in the upper 
and lower thirds. ' 
sma asnata the method of item analysis, | 1 at 
0 ned scoring 37 test papers for a sixth-grade science unit in 

ur item analysis might then proceed as follows: 


1. Rank the papers in order from the highest to the lowest 
Mares the 12 papers with the highest scores (appro 
3 e 12 papers with the lowest scores. -E 
š For each test item, tabulate the number of pupils in the upper and lower 
aa who selected ‘each alternative. This tabulation can be made directly on 
€ test paper or on the test item card as shown in Figure 11.2. 
YA the difficulty of each item (percentage of pupils who got 
5. Estimate the discriminating power of each item (difference between the 
Number of pupils in the upper and lower groups who got the item right). 
6. Evaulate the effectiveness of the distracters in each item (attractiveness of 


the j 
Incorrect alternatives) . 


let us suppose that we have just 
“weather.” 


score. 
ximately one-third) 


the item 


ight also be used, if more convenient. For more 


nt is frequently recommended and statistical 
tage. See Davis, footnote 4 above. 


s U 
Sete and lower quarters or halves m 
guide analysis, the upper and lower 27 per ce: 
S are most commonly based on this percen 


210 Constructing Classroom Tests 


Course Science Dates used 


Content Weather Instruments 


Outcome _ Knowledge 


ITEM 
Which of the following instruments is most useful in weather forecasting? 
A Anemometer 
Barometer 
Thermometer 


Rain gauge 


ITEM ANALYSIS DATA 


Altematives Cc D Omits 


upper 12 


lower 12 


Difficulty 71% «Discriminating Power _ .58 


Comment: 


Figure 11.2. Test item card with item-analysis data recorded. 


. 5 ion 
The first three steps of this procedure merely provide a convenient tabulatio 


of pupils’ responses from which we can readily obtain an estimate of eer 
ficulty, item discriminating power, and the effectiveness of each distracteT- sth 
ite 


latter information can frequently be obtained simply by inspecting the 
analysis data. Note in Figure 11.2, for example, that 12 pupils in the upper 
group and 5 pupils in the lower group selected the correct alternative ae, 
This makes a total of 17 out of the 24 pupils who got the item right; indican”, 
that the item has a fairly low level of difficulty. Since more pupils in the upper 
group than in the lower group got the item right, it is discriminating positively: 
That is, it is distinguishing between high and low achievers (as determined by 


Preparing, Administering, and Appraising Tests 211 


the total test score). Finally, since all of the alternatives were selected by some 
of the pupils in the lower group, the distracters (alternatives A, C, and D) 


appear to be operating effectively. 
Although item analysis by inspection w 
a test item, and is satisfactory for most classroom purposes, it is sometimes 
imate of item difficulty and discriminating 
ly simple formulas to the item 


ill reveal the general effectiveness of 


desirable to obtain a more precise est! 
power. This can be done by applying relative 
analysis data. 

Estimating Item Difficulty. The difficulty of a test item is indicated by the 
percentage of pupils who get the item right. Hence, we can estimate item dif- 
ficulty by means of the following formula, in which R = the number of pupils 
who got the item right, and T = the total number of pupils who tried the item. 


R 
Difficulty = 7 x 100 


Applying this formula to the item analysis data in Figure 11.2, our level of 


item difficulty (P) would be 71 per cent, as follows: 


17 
P= — x 100 = 71 per cent 
24 š: 
m item analysis data, our calcula- 


f the group only. We assume that 
entially 


. Note that in estimating item difficulty fro 
tion is based on the upper and lower thirds o 
the responses of pupils in the middle third of the group would follow ess 
the same pattern, This estimate of difficulty is sufficiently accurate for classroom 
use and is easily obtained since the needed figures can be taken directly from 
the item analysis data. 

. Estimating Item Discriminating Power. As we have already noted, an 
Item discriminates in a positive direction if more pupils in the upper group 
than the lower group get the item right. Positive discrimination indicates that 
the item is qisaiminatine in the same direction as the total test score. Since we 
assume that the total test score reflects achievement of desired objectives, we 


would like all of our test items to show positive discrimination. 
achievement test item refers to the degree to 


ls with high and low achievement. An esti- 
e obtained by subtracting the number 


ee discriminating power of an 
ch it discriminates between pupu 
mate of item discriminating power can b 
of pupils in the lower group who got the item right (Rz) from the number of 
Pupils in the upper group who got the item right (Rv) and dividing by one 
half of the total aits of pupils included in the item analysis (14T). Sum- 


marized ; 
arized in formula form, it would be®: 
Rv — R: 


Discriminating Power = -aT 


essed by means of a correlation coefficient 


sJ PEE ae ae 
tem discriminating power can also be expr š 
urpose. See D. A. Wood, Test Construction 


obtai Š 
C a directly from charts prepared for this p 
olumbus, Ohio: Charles E. Merrill Books, 1960), pase 85. 


212 Constructing Classroom Tests 


Applying this formula to the item analysis data in Figure 11.2, we would 
obtain an index of discriminating power (D) of .58, as follows: 


12—5 
D = 1 


= 58 


This indicates approximately average discriminating power. An item with 
maximum positive discriminating power would be one where all pupils in the 
upper group got the item right and all pupils in the lower group got the item 
wrong. This would result in an index of 1.00, as follows: 

12— 0 
D = —— = 1. 
12 1.00 

An item with no discriminating power would be one where an equal an 
of pupils in both the upper and lower groups got the item right. This wou 
result in an index of .00, as follows: 


_ 2-12 


9 12 


.00 


With this formula it is also possible to calculate an index of negative dis- 
criminating power. That is, one where more pupils in the lower group than 
the upper group get the item right. This is generally wasted effort, however 
since we are not interested in using items which discriminate in the wrong 
direction. Such items should be revised, so that they discriminate positively» 
or discarded. g 

Evaluating the Effectiveness of Distracters. How well each distracter 15 
operating can be determined by inspection and there is no need to calculate 
an index of effectiveness, although the formula for discriminating power con 
be used for this purpose. In general, a good distracter is one that attracts p 
pupils from the lower group than the upper group. Thus, it should discriminat? 
between the upper and lower groups in a manner opposite to that of the gine 
alternative. An examination of the following item analysis data will illustrat 
the ease with which the effectiveness of distracters can be determined by inspee 
tion. Alternative A is the correct answer. j 


Alternatives @ B C D Omits 


Upper 12 5 6 0 2 w 
Lower 12 3 & 0 Bi» 0 


First note that the item discriminates in a positive direction since 
upper group and 3 in the lower group got the item right. The index ° x 
criminating power is fairly low (D = .17), however, and this may be partly 4 
to the ineffectiveness of some of the distracters. Alternative B is a poor = 
tracter because it attracts more pupils from the upper group than the pee 
group. This is most likely due to some ambiguity in the statement of the item” 
Alternative C is completely ineffective as a distracter since it attracted n9 ey 
Alternative D is functioning as intended for it attracts a larger proportion 2. 
pupils from the lower group. Thus, the discriminating power of this item ies 


Preparing, Administering, and Appraising Tests 215 


eee by removing any ambiguity in the statement of the item 

eee a a: replacing alternatives B and C. The specific changes must, of 

os 3 sed on an inspection of the test item itself. Item analysis data merely 
ate poorly functioning items, not the cause of the poor functioning. 


Cautions in Interpreting Item Analysis Data 


Ite i ; ; F 
m analysis provides a quick, simple technique for appraising the effective- ~ 


thes “ay test items. The information provided by such an analysis is 
orp “ses. however, and must be interpreted accordingly." Follow- 
] J ° iz e major cautions to observe. 
In . sire a power does not necessarily indicate item validity. 
selecting sss of item analysis, we used the total test score as a basis for ' 
This is the ee group (high achievers) and the lower group (low achievers). 
are nist E ae procedure, since comparable measures of achievement 
lo soal dk ze able. Ideally, we would examine each test item in relation 
Dari enlas e ent measure of achievement. However, the best measure of the 
On iheachis ievement we are interested in evaluating is usually the total score 
eek Ta. uni 2m test we have constructed. This is so because each classroom 
Even can ait l related to specific instructional objectives and course content. 
S depend ar ized tests in the same content area are usually inadequate as 
A Be w . because they are aimed at more general objectives than 
Using aa by a specific classroom test in a particular course, : 
e total score from our classroom test as a basis for selecting high 


and lo i : ae š Ç 
ta w achievers is perfectly legitimate as long as we keep in mind we are 
g an internal criterion. In doing so, our item analysis provides evidence 

her than its validity. That is, 


E the internal consistency of the test rat r 3 
tatal tet aide how effectively each test item 1s measuring whatever the 
fete s measuring. Such item analysis data can be interpreted as evidence 

m validity only where the validity of the total test has been proven, oF 


ca bes 
pr eae assumed. 
defects low index of discriminating pow 
ive item. Items which discriminate poorly 


oe e for the possible presence o , h 
Gútëbme i ects, If none is found, and the items measure an important learning 
ina ie Pas should be retained for future use. Any item that discriminates 
A amy direction can make a contribution to the measurement of pupil 
ieee ent and low indices of discrimination are frequently obtained for 
Cl ns other than technical defects. 

ith. achievement tests are usually designe š seve 
80 on) i? of learning outcomes (knowledge, understanding; application, 
relativel ae this is the case, test items which represent an area receiving 
Sani ittle emphasis will tend to have poor discriminating power. For 

> if a test has forty items measuring knowledge of specific facts and ten 


er does not necessarily indicate a 
between high and low achievers 
£ ambiguity, clues, and other 


d to measure several dif- 


TD; 
BY. ee Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 
urst, Constructing Evaluation Instruments (New York: Longmans, Green, 1958). 


and V 


214 Constructing Classroom Tests 


items measuring understanding, the latter items can be expected to have low 
indices of discrimination. This is because the items measuring understanding 
have less representation in the total test score and there is typically a low cor- 
relation between measures of knowledge and measures of understanding. Low 
indices of discrimination, here, merely indicate that these items are measuring 
something different from what the major part of the test is measuring. Removing 
such items from the test would make it a more homogeneous measure of knowl- 
edge outcomes but it would also damage the validity of the test since it would 
no longer measure learning outcomes in the understanding area. Since most 
classroom tests measure a variety of types of learning outcomes, low positive 
indices of discrimination are the rule rather than the exception. 

Another factor which influences discriminating power is the difficulty of the 
item. Those items at the 50 per cent level of difficulty make maximum discrimi- 
nating power possible, since it is only at this level of difficulty that all pupils in 
the upper half of the group can get the item right while all pupils in the lower 
half get it wrong." As we move away from the 50 per cent level of difficulty, 
toward easier or more difficult items, the index of discriminating power becomes 
smaller. Thus, items which are very easy or very difficult have low indices of 
discriminating power. It is frequently necessary, or desirable, to retain such 
items, however, in order to measure a representative sample of learning out- 
comes and course content. 

To summarize, a low index of discriminating power should alert us to the 
possible presence of technical defects in a test item but it should not cause u5 
to discard an otherwise worthwhile item. A well-constructed achievement test 
will, of necessity, contain items with low discriminating power and to discard 
them would result in a test which is less, rather than more, valid. I 

3. Item analysis data on classroom tests are highly tentative. Item analysis 
procedures focus our attention so directly on the difficulty and discriminating 
power of a test item that we are commonly misled into believing that these are 
fixed, unchanging characteristics of the item. This, of course, is not true. Item 
analysis data will vary from one group to another, depending upon the leve 
of ability of the pupils, their educational background, and the type of instruc” 
tion they have had. Add to this the small number of pupils we have available 
for analyzing the items in our classroom tests, and the tentative nature of OUP 
item analysis data becomes readily apparent. If just a few pupils change their 
responses, our indices of difficulty and discriminating power can be increase 
or decreased by a considerable amount. . 

The tentative nature of item analysis data should discourage us from makinë 
fine distinctions between items on the basis of indices of difficulty and dis- 
criminating power. If an item is diseriminating in a positive direction, all of the 


S It should be noted that the 50 per cent level of difficulty does not guarantee maximum 
discriminating power but merely makes it possible. If half of the pupils in the upper group 
and half of the pupils in the lower group got the item right, it would still be at the 50 per 
cent level of difficulty but the index of discrimination would be zero. 


Preparing, Administering, and Appraising Tests 215 


alternatives are functioning effectively, and it has no apparent defects, it can be 
considered satisfactory from a technical standpoint. The important question 
then becomes, not how high is the index of discriminating power, but rather, 
does the item measure an important learning outcome? In final analysis, the 
worth of an achievement test item must be based on logical rather than statis- 
tical considerations. 

When used with classroom tests, item analysis provides us with a general 
appraisal of the functional effectiveness of the test items, a means for detecting 
technical defects, and a method for identifying instructional weaknesses. For 
these purposes, the tentative nature of item analysis data is relatively unimpor- 
tant, Where we record indices of item difficulty or discriminating power on item 
cards for future use, we should interpret these indices as rough approximations 
only. As such, they are still superior to our unaided estimates of item difficulty 


and discriminating power. 


Building a Test Item File 


A file of effective test items can be built and maintained easily if items are 
recorded on cards like the one presented in Figure 11.2. By indicating on the 
item card both the learning outcome and the subject-matter content measured 
by the item, it is possible to file the cards under both headings. A satisfactory 
Procedure is to use areas of subject-matter content as major categories with the 
learning outcomes forming the subcategories. For example, our illustrative item 
in Figure 11.2 measures knowledge of wea o it would be 


placed in the first category under weather instruments, as follows: 


Weather Instruments: 
Knowledge 
Understanding 
Application 

This type of filing system makes it possible to select items in accordance with 
any table of specifications in the particular area covered by the file. 

Building a test item file is a little like building a bank account. The first 
Several years are concerned mainly with making deposits. Withdrawals must be 
delayed until a sufficient reserve is accumulated. Thus, items are recorded on 
cards as they are constructed; item analysis information is added after the items 


have been used; and then the effective items are deposited in the file. At first it 
Seems to be additional work with very little return. However, in a few years it 


Ís possible to start using some of the items from the file and supplementing these 
with other newly constructed items. As the file grows, 1t becomes possible to 
Select the majority of the items from the file for any given test whew repeat- 
ing the items too frequently. To prevent using a specific test item too often, 


the date an item is used is usually recorded on the card. f ; 
A test item file assumes increasing importance as we shift from test items 
which measure knowledge of specific facts to those which measure understand- 


216 Constructing Classroom Tests 


ing, application, and thinking skills. Items in these latter areas are difficult and 
time consuming to construct. With all of the other demands on our time, it is 
nigh unto impossible to construct effective test items in these areas each time 
we prepare a new test. We seem to have two alternatives: either we neglect the 
measurement of learning outcomes in these areas (which has been the typical 
practice) or we slowly build a file of effective items in these areas. The choice 
seems obvious, if the quality of pupil learning is our major concern. 

For a test file to be most effective, pupils should not be permitted to keep 
their test papers after the test has been scored and discussed in class. This dis- 
turbs some teachers who feel that the pupils should have their test papers for 
later study and review. There is no particular advantage in permitting pupils 
to keep their test papers, however, if there has been an adequate discussion of 
the test results in class and a general review in the particular areas of weakness 
revealed by the test. After all, our major aim is to help pupils improve their 
general knowledge and understanding in a given area. The test is merely a sam- 
ple of this achievement, Although a discussion of test results will contribute to 
improved learning, extended study of the answers to specific test items may 
actually detract from our major aim. This certainly would be the case where 
pupils concentrated on learning the sample of material included in the test to 
the neglect of the larger area of achievement it represented. 


SUMMARY 


The same care that goes into the construction of individual test items should 
be carried over into the final stages of test development and use. Giving careful 
attention to the procedures for: (1) preparing the test for use, (2) adminis- 
tering and scoring the test, and (3) appraising the results, will provide increased 
assurance that valid results are obtained. 

The preliminary steps in preparing the test for use are simplified if the items 
are recorded on cards. This facilitates the task of editing the items and arranging 
them in the test. The editing process involves a check of each item to be certain 
it is free from ambiguity and irrelevant clues and that its functioning content 
is in harmony with the intended purpose. The final group of items selected for 
the test should also be checked against the table of specifications to make sure 
that a representative sample of the learning outcomes and course content is 
being measured. In arranging the items in the test, all items of one type should 
be placed together in a separate section. The items within each section should 
be organized by the learning outcome measured and then placed in order of 
ascending difficulty. The directions for the test should convey clearly to the 
pupil the purpose of the test, the time allowed for answering, the basis for 
answering, the procedure for recording the answers, and what to do about 
guessing. 

The procedures for administering the test should provide all pupils with a 
fair chance to demonstrate their achievement. Both the physical and the psy- 


— 


Preparing, Administering, and Appraising Tests 217 


chological atmosphere should be conducive to maximum performance. Unnec- 
essary interruptions and unfair aid from other pupils, or the teacher, should 
be avoided. 

The scoring of the test can be facilitated by a scoring key, or scoring stencil 
if separate answer sheets are used. Counting each right answer one point is 
usually satisfactory. A correction for guessing is unnecessary with the typical 
classroom test where pupils have sufficient time to consider all questions. Be- 
cause assumptions underlying the use of correction-for-guessing formulas are 
questionable, it is recommended that they be used only with speeded tests. 

After the test has been scored, it is desirable to appraise the effectiveness of 
each test item by means of item analysis. This involves the use of simple statis- 
tical procedures for determining the difficulty and discriminating power of an 
item and the effectiveness of each of its distracters. The difficulty of an item 
is indicated by the percentage of pupils getting the item right. An item’s dis- 
criminating power refers to the degree to which it discriminates between pupils 
with high and low achievement. Effective distracters are those which attract 
more low achievers than high achievers. The results of item analysis are valu- 
able in discussing the test with pupils, in planning remedial work, in improving 
our teaching and testing skills, and in selecting and revising items for future 
use. For these and other purposes, however, item analysis data must be inter- 
preted cautiously because of their limited and tentative nature. 

Building a test file of effective items involves recording the items on 5 x 8 
cards, adding item analysis information, and filing the cards by both subject 
matter content and learning outcomes measured. Such a test item file is espe- 
cially valuable in the areas of complex achievement where the construction of 
test items is difficult and time consuming. When a sufficient file of high quality 
items has been assembled, the burden of test preparation is considerably 


lightened. 


ONS FOR FURTHER READING 


TI 
Sear ” Educational and Psychological 


Ebel, R. L. “Procedures for the Analysis of Classroom Tests, 

Measu 364, 1954. . ) 
Ebel, R. Looe mae 5. “Tests and Examinations,” Encyclopedia of Educational 
Research, 3rd edition, New York: Macmillan, 1960. Pages 1510-1513. Summarizes re- 


search on administration, scoring, and item analysis. 
Furst, E. J. Constructing Evaluation Instruments. New York: Longmans, Green, 1958. Chap 


ter 11: “Review, Assembly, and Reproduction.” Chapter 12: “Administration and Scor- 


ing.” Chapter 13: “Analysis and Revision.” — ' 
Katz, M. italon Classroom Tests by Means of Item Analysis,” The Clearing House, 35, 


ee eee culty Level,” Ameri- 


Sax, G., and Marybell Reade. “Achievement as à Function of Test Difi 
can Educational Research Journal, 1, 22-25, 1964. N 
Tinkelman, S. N. Improving the Classroom Test. New York: Bureau of Examinations and 
Testing, University of the State of New York, 1957. Suggestions for test construction and 
I i isi ts, 
a useful checklist for reviewing and appraising classroom tests. 
Wood, Dorothy A. Test Construction. Columbus, Ohio: Charles E. Merrill Books, 1960. Chap- 


ter 9: “Item Analysis.” 


218 Constructing Classroom Tests 


Test Bulletins 


Diederich, P. Short-Cut Statistics for Teacher-Made Tests. Evaluation and Advisory Service 
Series, No. 5, Princeton, New Jersey: Educational Testing Service, 1960. Describes sim- 
ple item analysis procedure based on a show of hands in class. 

Doppelt, J. D. The Correction for Guessing. Test Service Bulletin, No. 46, New York: The 
Psychological Corporation, 1954. 

Multiple-Choice Questions: A Close Look. Test Development Division, Princeton, New Jer- 
sey: Educational Testing Service, 1963. Illustrative item analysis data for multiple-choice 
items. 


hapter 12 
standardized tests 
of achievement and 
scholastic aptitude 


EEE AE SN PENA A eT 


Every teacher should be familiar with standardized tests of achievement 
and scholastic aptitude . . . but which ones? There are well over one thou- 
sand tests in this area and the mass of test titles is confusing. . . . Learning 
the basic principles of achievement and aptitude testing—and becoming 
familiar with typical examples of the various types of tests in each area— 
Provide preliminary guide lines to the effective selection and use of stand- 
ardized tests. 


The variety of standardized tests available for use in the school can be classi- 


fied into three general areas: (1) achievement, (2) aptitude (or intelligence), 
and (3) personality. The last area includes measures of various aspects of 
adjustment, attitude and interest. Since most standardized instruments for 
measuring such personality characteristics are self-report techniques, rather 
than tests in the strict sense of the term, they will be considered in Chapter 16 
along with other methods of evaluating personal-social development. Here, we 
shall limit ourselves to standardized tests of achievement and scholastic aptitude. 

There was a time in educational measurement when it was possible to list 
and describe the standardized tests available for use in the schools. This time 
has long since passed. In a book such as this, it would now be infeasible to list, 
let alone describe, the numerous educational tests available. What we have 
decided to do in this chapter is to describe various types of achievement and 
scholastic aptitude tests, to indicate some of the principles pertinent to their 
selection and use, and to list typical examples of the most widely used tests in 
each area. Although the tests included in this chapter are of high general quality, 
there are, of course, other educational tests of equally high quality. It is hoped 


221 


222 Using Standardized Tests 


that the tests referred to in this chapter will be viewed merely as examples of 
the many good achievement and scholastic aptitude tests available. 


CHARACTERISTICS OF STANDARDIZED TESTS 


A standardized test is one which has a fixed set of test items, specific direc- 
tions for administration and scoring, and has been given to representative 
groups of individuals for the purpose of establishing norms. Standard content 
and procedure make it possible to give an identical test to individuals in differ- 
ent places and at different times. As a result, an individual’s score on a standard- 
ized test can be compared to the scores of others who have taken the same test. 
Norms are merely the typical or average scores made by representative groups 
of individuals at various age and grade levels. Norms make it possible to com- 
pare an individual’s score with other individuals whose characteristics are 
known. Thus, it is possible to administer a standardized test to a pupil and 
compare his performance with that of a typical twelve-year-old, an average 
sixth grader, or a select group of pupils attending private schools in New 
England. 


Characteristics of a carefully standardized test can be summarized as follows: 


1. The test items are of high technical quality. They have been developed by 
test specialists; tried out experimentally (pretested) ; and selected on the basis 
of difficulty, discriminating power, and relationship to a rigid set of speci- 
fications. 

2. Directions for administering and scoring are so precisely stated that the 
procedures are standard for different users of the test. 

3. Norms, based on representative groups of individuals, are provided as 
aids in interpreting test scores, These norms are based on various age and grade 
groups on a national, regional, or state level, N 
as private schools, might also be supplied. 

4. Equivalent or comparable forms of the test are usually provided as well 
as information concerning the degree to which the forms are comparable. 


5. A test manual and other accessory materials are provided to guide admin- 
istering and scoring the test and interpreting and using the results. 


orms for special groups, such 


Despite the common characteristics of standardized tests, selecting a test in 
any given area is no simple task. This is so because (1) there are literally hun- 
dreds of tests in each area, (2) there is wide variation in the completeness and 
quality of test materials, and (3) each test serves a slightly different use. Stand- 
ardized tests, like pupils, have their individual differences as well as their com- 
mon characteristics. The remainder of this chapter will be devoted to some of 


these major differences and the role these differences play in the selection of 
standardized tests. 


STANDARDIZED ACHIEVEMENT TESTS 


Although teachers depend to a large extent on informal classroom tests for 
evaluating pupil achievement, standardized tests can also help determine the 


Tests of Achievement and Scholastic Aptitude 223 


patrat to which instructional objectives are being achieved. In addition, stand- 
ardized achievement tests can make a number of unique contributions to the 


instructional program of the school. 


Standardized Versus Informal Classroom Tests 


Standardized tests and carefully constructed classroom tests are similar in 
many ways. Both are based on a carefully planned table of specifications, both 
have the same type of test items, and both provide clear directions to the pupils. 
The main differences between the two types reside in (1) nature of the learning 


outcomes and content measured, (2) quality of the test items, (3) 
the tests, (4) procedures for administeri 
of scores. Comparative advantages of stand 


tests of achievement are shown in Table 12.1. 


COMPARATIVE ADVA. 


Table 12.1 


reliability of 


ng and scoring, and (5) interpretation 
ardized and informal classroom 


NTAGES OF STANDARDIZED AND INFORMAL 


CLASSROOM TESTS OF ACHIEVEMENT 


Standardized 
Achievement Tests 


Informal 
Achievement Tests 


Measures outcomes and content 


Well adapted to outcomes and 


L 1 common to majority of United content of local curriculum. 
wasa States schools. Tests of basic Flexibility affords continuous 
utcomes skills and complex outcomes adaptation of measurement to 
and Content adaptable to many local situa- new materials and changes in 
Measured tions; content-oriented tests sel- procedure. Adaptable to various 
dom reflect emphasis or time- size work units. Tend to neglect 
liness of local curriculum. complex learning outcomes. 
General quality of items high. Quality of items is unknown un- 
Quality of Written by specialists, pretested, less test item file is used. Quality 
Test Items and selected on basis of effective- typically lower than standard- 
ness: ized, due to limited time and 
skill of teacher. 
Reliability high; commonly be- Reliability usually unknown; can 
Reliability tween .80 and .95, frequently is be high if carefully constructed. 
above .90. 


Administration 
and Scoring 


Procedures standardized ; specific 


instructions provided. 


Uniform procedures possible but 
usually flexible. ` 


Interpretation 
of Scores 


Scores can be compared to norm 
groups: Test manual and other 


guides aid interpretation and 


use. 


Score comparisons and interpre- 
tations limited to local school 
situation. 


A review of the c 
that each is su 


comparative advantages of the two types of tests indicate 


perior for certain purposes and inferior for others. The broader 
i ti igi 
coverage of the standardized test, its more rigidly controlled procedures of 


224 Using Standardized Tests 


administering and scoring, and availability of norms for evaluating scores 
make it especially useful for the following instructional purposes. 


1. Evaluating the general educational development of pupils in the basic 
skills and in those learning outcomes common to many courses of study. 

2. Evaluating pupil progress during the school year or over a period of years. 

3. Grouping pupils for instructional purposes. 

4. Diagnosing relative strengths and weaknesses of pupils. 


5. Comparing a pupil’s general level of achievement with his scholastic apti- 
tude or intelligence, 


The inflexibility of the standardized test makes it of little value for those 
purposes for which the informal classroom test is so admirably suited. 


1. Evaluating the learning outcomes and content unique to a particular class 
or school. 


2. Evaluating day-to-day progress of pupils and their achievement on work- 
units of varying sizes. 

3. Evaluating knowledge of current developments in such rapidly changing 
content areas as science and social studies, 


The complementary functions of the two types of tests indicate that both are 
essential to a sound instructional program. Each provides a specific type of 
information regarding the pupils’ educational progress. In both cases, however, 
the value of the information depends on the extent to which the tests are related 
to the instructional objectives of the school. Standardized achievement tests, 
like informal classroom tests, can serve the many worthwhile instructional pur- 
poses attributed to them only when they measure the particular learning out- 


comes and content deemed important by those responsible for the instructional 
program. 


Achievement Test Batteries 


Standardized achievement tests are frequently used in the form of test bat- 
teries. A battery consists of a series of individual tests all standardized on the 
Same representative group of pupils. This makes it possible to compare test 
Scores on the separate tests and thus determine the relative strengths and weak- 
nesses of pupils in the different areas covered by the test. With an elementary 
school test battery, for example, it is possible to determine that a pupil is 
strong in language skills but weak in arithmetic skills, good in reading but less 
proficient in spelling, and the like. Such comparisons are not possible with 
Separate tests which have been standardized on different groups of pupils, since 
the base for comparison is not uniform. 

A major limitation of test batteries is that’ all parts of the battery are usually 
not equally appropriate for measuring the“ objectives of a particular school. 
_ When a test battery is constructed, it is based on objectives and content con- 
sidered important by the specialists building the test. Although the goals of a 
particular school are apt to be in harmony with some sections of the battery, 


A 


Tests of Achievement and Scholastic Aptitude 225 
e in harmony with all sections. Variations 
m to another and differences in 
e it very unlikely that the various 


it is fairly certain that they will not b 
in subject-matter content from one curriculu 
grade placement of instructional materials maki 
sections of a test battery will be uniformly applicable to the instructional pro- 
gram of any given school. This limitation is especially pronounced in content- 
oriented test batteries. It is of less significance in batteries designed to measure 
basic skills and general educational development. 

Elementary School Test Batteries. The use of achievement test batteries 
has been especially prominent at the elementary school level. This is under- 
standable since there is considerable uniformity in the learning outcomes sought, 
especially in the basic skills. All elementary batteries include sections on read- 
ing, language, and arithmetic skills. In addition, some include measures of 
work-study skills and tests in such content areas as science and social studies. 
Those batteries which include tests of knowledge in content areas usually also 
make available a separate partial battery limited to the measurement of basic 
skills. Batteries confined to the basic skills are generally preferred since content- 
oriented tests are seldom well suited to the specific objectives of the local instruc- 


tional program. 
here are a number of elementary 
achievement. Typical of the more widely use 


school batteries available for measuring 
d test batteries are the following: 


California Achievement Tests 

Towa Tests of Basic Skills 

Metropolitan Achievement Tests 

SRA Achievement Series 

Sequential Tests of Educational Progress 
Stanford Achievement Test 


(STEP) 


briefly described in the 


These and other tests referred to in this chapter are h 
vs of the tests are avail- 


appendix, More detailed descriptions and critical reviev 

able in Buros’ Mental Measurements Yearbooks." 
High School Test Batteries. The achievement 

Widely used at the high school level. Diversity of cou: 


test battery has been less 
rse offerings, variations in 
content among courses jn the same subject-matter area, and the greater flexi- 
bility pupils have in selecting courses for their programs makes it extremely 
difficult to design a battery for general high school use. Test publishers have 
attempted to find a common core on which to base a battery of tests in one of 
the following ways: 


1. Continue to build the tests arou 
and language. This is the procedure use i 
ment Tests. This particular battery, for the 


nd the basic skills of reading, arithmetic, 
d in tests such as the California Achieve- 
igh school and college level, is one 


10. K. Buros, The Fifth Mental Measurements Yearbook (Highland Park, New Jersey: 

Gryphon Press 1959). o. K. Buros, The Fourth Mental Measurements Yearbook (Highland 

Park, New Jersey: Gryphon Press, 1953). O. K. Buros, T'he Third Mental Measurements 
5 n ersey: Rutgers University Press, 1949). 


Yearbook (New Brunswick, New J 


226 Using Standardized Tests 


of a continuous series of batteries ranging from grade one through the college 
sophomore level. 

2. Build tests of knowledge of specific course content in the basic areas of 
study: mathematics, science, social studies, and English. This approach was 
used in tests such as the Essential High School Content Battery. 

3. Build tests which measure general educational development in intellectual 
skills and abilities which are not dependent upon any particular series of 
courses. Typical tests in this category are the Jowa Tests of Educational Develop- 
ment and the Sequential Tests of Educational Progress (STEP). The Iowa test 
is designed for high school use only, while the STEP. high school battery is 


one of a continuous series of batteries ranging from grade four through the 
college sophomore level. 


Tests of general educational development are of special interest because they 
measure complex learning outcomes that cut across subject-matter lines and are 
common to the major content areas of the school. They emphasize understand- 
ings, interpretive skills, and the ability to apply knowledge and skills to new 
situations. Since these learning outcomes are closely related to the ultimate 
objectives of education, such tests are especially likely to have a desirable influ- 
ence on the curriculum and teaching methods of the school. 


Achievement Tests in Specific Areas 


« In addition to achievement test batteries, there are literally hundreds of sepa- 
rate tests designed to measure achievement in specific areas. The majority of 
these can be classified as subject-matter tests or reading tests of the general 
survey type. A limited number of tests have also been developed solely for use 
as diagnostic tools. 

The separate test has certain advantages over a test battery. First, it is easier 
to select a separate test that fits the instructional objectives of a particular area- 
The difficulty of relating an entire battery of tests to instructional objectives 
was pointed out earlier. Second, a separate test is usually longer than the sub- 
tests of a battery. This provides a more adequate sample of behavior and more 
reliable part scores for diagnostic purposes. Third, the flexibility of the separate 
test makes it easier to adapt to classroom instruction. The teacher can admin- 
ister the separate test when it best fits his instructional needs, rather than follow- 
ing the rigid schedule of the school testing program. 

The major limitation of separate tests is that each test is usually standardized 
on a different group of pupils. Since norm groups are not comparable, relative 
achievement of pupils in different areas cannot be compared. For example, it is 
not possible to determine whether a pupil has achieved more in science than 
mathematics, or in social studies than English, if the tests were not standardized 
on the same representative group of pupils. This is an especially serious limita- 
tion if the results are to be used for guidance purposes. Knowing a pupil’s 
strengths and weaknesses in the basic areas of achievement is essential for 
proper educational and vocational planning. 

Advantages of both the achievement test battery and separate tests in specific 


Tests of Achievement and Scholastic Aptitude 227 


areas are capitalized upon in a comprehensive program. The achievement bat- 
tery, administered as part of the school-wide testing program, provides a gen- 
eral survey of the pupils’ educational development, and separate tests are selected 
for more specific instructional purposes. Thus, an elementary teacher might 
follow up a test battery with a more diagnostic reading test, or a high school 
teacher might follow it up with a subject-matter test more directly related to 
the specific learning outcomes he is emphasizing. Where such a comprehensive 
program is not possible, the achievement test battery is generally favored and 
teachers must rely much more heavily on their own informal classroom tests 
as aids to teaching. . 

Subject-Matter Tests. Specific tests in the various subject-matter areas are 
so numerous and they cover such a wide array of learning outcomes and sub- 
ject-matter topics that it is infeasible to list typical examples. There are over 
a hundred separate tests in each of the major content areas taught in the high 
school—English, mathematics. science. and social studies. A bibliography of 
tests available for use in a particular area may be found in Buros’ Tests in 
Print? and critical evaluations of the tests may be obtained from Buros’ Mental 


Measurements Yearbooks.” 
There are several cautions that should 
In specific subject-matter areas: 


be considered when evaluating tests 


r tests are content oriented, the date of construc- 
lopments jn some content areas, such as science 
a rapid rate that content-oriented 


1. Since most subject-matte 
tion is especially important. Deve 
and social studies, are taking place at such 
tests are soon out of date. 

2. In addition to timelin 
rected toward its appropriateness for t 


ess of test content, special attention should be di- 
he particular course in which it is to be 


used. Since a standardized test includes only that content which is common to 
a variety of school systems, it is apt to lack comprehensiveness and at the same 
time to include questions on material that has not been included in the local 
curriculum. 

3. Many subject-matter tests are limited to measurement of knowledge out- 
comes, although there are a number of notable exceptions. Standardized tests 
of specific knowledge are seldom as pertinent and useful as well-constructed 
teacher-made tests in the same area. . 

4. Where a subject-matter test measures a variety of learning outcomes 
beyond that of specific knowledge, it is important that learning outcomes meas- 
ured by the test be in harmony with those emphasized in instructional objectives. 


Due to its flexibility and timeliness, the informal teacher-made test is fre- 
quently better suited to the measurement of instructional objectives in a par- 


Print (Highland Park, New Jersey: Gryphon Press, 1961). 
Mental Measurements Yearbook (Highland Park, New Jersey: 
Gryphon Press, 1959) - O. K. Buros, The Fourth Mental Measurements Yearbook (Highland 
Park, New Jersey: Gryphon Press, 1953). O. K. Buros, The Third Mental Méasuremenis 
Yearbook (New Brunswick, New Jersey: Rutgers University Press, 1949). 


yy N 


z O. K. Buros, Tests in 
3 O. K. Buros, The Fifth 


228 Using Standardized Tests 


ticular course than is the standardized subject-matter test. However, when 
carefully selected in terms of the course content and learning outcomes, the 
standardized test can serve as a check on the teacher's informal classroom tests 

Reading Tests. One of the most widely used tests at all levels of instruction 


is the reading test. It plays a prominent role in achievement test batteries and 
in tests of general educational development. In addition, there are well over a 
hundred separate tests of reading ability. 


Many reading tests are of the survey ty 
pupil’s general level of reading ability. Such tests commonly measure vocabulary, 


rate of reading, and comprehension. Typical tests of this type, at the elementary 
and high school levels, include the following: 


pe. These are designed to measure a 


Elementary School Level 


Diagnostic Reading Tests: Survey Section 
Gates Primary Reading Tests 

Gates Advanced Primary Reading Tests 
Gates Reading Survey 

Nelson Reading Test 


Separate tests from achievement batteries 


High School Level 
Davis Reading Test 
Diagnostic Reading Tests: Survey Section 
Kelley-Greene Reading Comprehension Test 
Nelson-Denny Reading Test, Revised Edition 


Reading Comprehension: Cooperative English Tests (1960 Revision) 
Separate tests from achievement batteries 


In selecting a survey readin 
be misled into believing th 
vocabulary, reading rate, 
test publishers concerning 
skills which constitute com 
should be measured, Cons 
factual information presen 


g test for instructional purposes, one should not 
at they are all somewhat alike because they measure 
and comprehension. There is little agreement among 
the type of reading material to include, the specific 
prehension, and the depth with which ee panel ge 
equently, some tests will emphasize simple recall o 
ted in a reading passage while others will call forth 
complex interpretive skills, Selection of a reading test, like the selection of other 
achievement tests, must be in terms of the desired learning outcomes to be 
measured. Unless a reading test measures the specific aspects of reading ability 
included in the instructional objectives, it can provide a misleading picture of 
pupil progress, k 
În aiea to survey reading tests, there are a number of diagnostic reading 
tests. These will be noted in the following section. 5 š i 
Diagnostic Tests. Most achievement tests have some diagnostic value in that 


“For a brief description of the tests listed here, see the appendix. For other reading tests, 
consult Buros’ Tests in Print (see footnote 2, above). 


Tests of Achieuement and Scholastic Aptitude 229 


subscores or individual test items can be analyzed to diagnose pupils’ strengths 


and weaknesses. Tests designed more specifically for diagnostic purposes, how- 
ever, differ from survey tests in two important ways: (1) they have a larger 
number of part scores and a correspondingly larger number of test items in 
each area; (2) the test items are based on a detailed analysis of the specific 
skills involved in successful performance and a study of the most common errors 
made by pupils. 

The two areas in which diagnostic tests are most common are reading and 
arithmetic. Typical tests in the reading area include Diagnostic Reading Tests, 
Durrell Analysis of Reading Difficulty, New Edition, and the SRA Reading 
Record. One of the most comprehensive sets of diagnostic tests in arithmetic is 
entitled Diagnostic Tests and Self Helps in Arithmetic. The materials include 
four screening tests used to locate the area of deficiency and twenty-three diag- 
nostic tests. The diagnostic tests cover the basic facts and the fundamental 
operations with whole numbers, common fractions, decimal fractions, per cent, 
and measures. On the reverse side of each diagnostic test is a series of related 


self-help exercises.” 
Several reservations should be kept in min 


nostic tests. 


d when selecting and using diag- 


point toward diagnosis. In 


flects the author's view 
aluated in light of the 


1. Each diagnostic test re 
dures must be ev: 


selecting a test, the diagnostic proce 


type of information desired. H a 
2. Diagnostic tests indicate the typical errors a pupil makes but they do not 
uses can be easily inferred 


indicate the causes of the errors. Although some cal u 
from the type of error made, causes of a particular deficiency are frequently 


multiple and interrelated in a complex manner. s s 1 
3. Related to the previous point is the fact that diagnostic tests provide only 


partial information for diagnosing 4 pupil’s difficulty. In the reading area, for 
example, intelligence, vision, hearing, physical condition, and emotional factors 


must also be considered. 


4. Results from diagnostic tests tend to ha b ! 
relatively few items measuring each type of error. Thus, the findings regarding 


specific strengths and weaknesses for any particular pupil should be regarded 
as clues to e verified by other objective evidence and by regular classroom 


observation. 


ve low reliability because of the 


In summary, a diagnostic test is a good starting point, but supplementary 
information is ‘needed before an effective remedial program can be initiated. 


SCHOLASTIC APTITUDE TESTS 


Since one of the major aims of the school is to assist each pupil to achieve 
the maximum of which he is capable, it is not surprising that tests of mental 
of the tests listed here, see the appendix. For other diagnostic 


5 For a brief description 
Print (see footnote 2, above). 


tests, consult Buros’ Tests in 


230 Using Standardized Tests 


ability play a prominent role in the school testing program. An estimate of the 
mental ability of pupils aids in individualizing instruction, organizing classroom 
groups, identifying underachievers, placing pupils in special classes, and in gen- 
eral planning for classroom instruction. Although the results of achievement 
tests are also useful for these purposes, tests of mental ability make a unique 
contribution in identifying the learning potential of pupils. 

Tests designed to measure an individual’s potential for learning have long 
been called intelligence tests. This usage has been declining, however, since so 
many people have come to associate the concept intelligence with inherited 
capacity. In place of the term intelligence test have come such terms as mental 
ability test and scholastic aptitude test. When the tests are used for school pur- 
poses, the latter term is generally preferred. 

It is important to recognize that scholastic aptitude tests do not measure 
native capacity or learning potential directly. Like all other tests used in school, 
a scholastic aptitude test measures performance based on learned abilities. In 
this sense, it is a specific type of achievement test. Any conclusion concerning 
capacity or potential for learning must be inferred from the results and such 


inferences can be validly made only when the following conditions (or assump- 
tions) have been met: 


1. All pupils have had an equal opportunity to learn the types of tasks pre- 
sented in the test. 
2. All pupils have been motivated to do their best on the test. 


3. All pupils have the “enabling behaviors” 
for maximum performance on the test. 


4. None of the pupils is hampered by test panic, emotional problems, or 


tis. R w 
other “disabling behaviors” which can prevent maximum performance on the 
test. 


(such as reading skill) necessary 


These conditions are seldom fully met, of course, but the extent to which they 
are not met determines how much we err in estimating learning potential from 
scholastic aptitude test scores. Many of the misinterpretations and misuses of 
scholastic aptitude tests arise from failure to recognize the influence these con- 


ditions have on test results and consequently on the inferences that can be 
drawn from them. 


Achievement and Scholastic Aptitude Tests 


Before proceeding with specific types of scholastic aptitude tests, it might 
well to consider some basic similarities and differences between achievement 
tests and scholastic aptitude tests. A common distinction is that achievement 
tests measure what a pupil has learned and scholastic aptitude tests measure 
his ability to learn new tasks. While this appears to be a clear distinction, it 
oversimplifies the problem and covers up some important similarities and dif- 


be 


ëL. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


Tests of Achievement and Scholastic Aptitude 231 


ferences.? Actually, both types measure what a pupil has learned and both are 


useful for predicting 


his success in learning new tasks. The major differences 


lie in (1) the type of learning measured by each test, and (2) the type of 


prediction for which each is most useful. 
The types of learning measured by achievement and scholastic aptitude tests 
vary along a continuum. At one extreme is the content-oriented achievement 


test which measures knowledge of specific 


— is the culture-oriented scholastic aptit 
earned from the general culture and which is relative 
ences. In between, and muc 
measured, are the achievement test of 


the school-oriented scholastic aptitude test. 


learned in school. However, 
variety of intellectual skills and abilities 
taught in the school, while the school-oriente 
intellectual skills in a more limited area, 
These differences in the type of learni 
summarized in Table 12.2. In reviewing t 
these are merely convenient ¢ 
considerable overlap in the type o 


test types. 


Table 12.2 


TYPE OF LEARNING MEAS 


ategoriés for describing tests 
f learning measured by each of the general 


course content, and at the other 
ude test which measures behavior 


ly free of school experi- 


h closer together in terms of the types of learning 
general educational development and 
Both measure general abilities 
the test of educational development measures a 
pertaining to the major content areas 
d scholastic aptitude test measures 
such as verbal and numerical reasoning. 
ng measured by the different. tests are 
his table, it must be remembered that 


and that there is 


URED BY ACHIEVEMENT 


AND SCHOLASTIC APTITUDE TESTS 


Test Area 


General Test Types 


Type of Learning Measured 


Achievement 


Content-Oriented Achievement 


Tests 
(e.g. Essential High School Con- 


tent Battery) 
Tests of General Educational De- 


velopment 
(e.g. Iowa Tests of Educational 


Development) 


Knowledge of subject matter in 
particular courses such as English, 
mathematics, science, and social 
studies. 

Basic skills and complex learning 
outcomes common to many courses, 
such as the ability to apply facts 
and principles and to interpret 
data. 


Scholastic 
Aptitude 


School-Oriented Aptitude Tests 
(e.g., Cooperative School and Col- 


lege Ability Tests) 


Culture-Oriented Aptitude Tests 
(e.g, The Lorge-Thorndike Intel- 


ligence Tests) 


Verbal, numerical, and general 
problem-solving abilities similar to 
those learned in school, such as 
reading comprehension, vocabu- 
lary, and arithmetic reasoning. 

Verbal, numerical, and general 
problem-solving abilities derived 
more from the general culture than 
from common school experiences. 


7A. G. Wesman, Aptitude, Int 


(New York: The Psychological Corporation, 1956). 


elligence, and Achievement, Test Service Bulletin No. 51 


232 Using Standardized Tests 


The four general classes of tests presented in Table 12.2 can also be distin- 
guished in terms of the types of predictions for which each is most useful. Since 
past achievement is frequently the best predictor of future achievement, both 
types of achievement test are useful in predicting future learning. In general, 
the content-oriented achievement test can predict how well a pupil will learn 
new knowledge in the same content area but it is of little value in predicting 
future learning in other areas. For example, a test of first semester English will 
be a good predictor of second semester English but not of second semester 
mathematics, science, or social studies. In other words, its value as a predictor 
of future learning depends largely on the relationship between the content being 
measured and the content in the future learning situation. Tests measuring gen- 
eral educational development are much more effective predictors of future 
achievement than content-oriented tests because they measure intellectual skills 
and abilities common to a variety of content areas. In fact, tests of educational 
development have been shown to be as good predictors of general school achieve- 
ment as the best scholastic aptitude tests.5 

If achievement tests are such good predictors of future learning, why do we 
use scholastic aptitude tests in schools? There are at least several good reasons. 
(1) A scholastic aptitude test can be administered in a relatively short time (as 
short as 20 minutes) » while a comprehensive battery of achievement tests would 
take several hours. One test of general educational development takes approxi- 
mately eight hours. (2) In addition to time saved, scholastic aptitude tests can 
be used with pupils of widely varying educational backgrounds. Since the type 
of learning measured is that common to most pupils, an individual is less apt 
to be penalized because of specific weaknesses in his past training. (3) Scholastic 
aptitude tests can be used before a pupil has had any training in a particular 
area. For example, success in a French course cannot be predicted by an achieve- 
ment test in French until the person has had some training in it. (4) There 15- 
an additional reason which applies more specifically to the culture-oriented 
scholastic aptitude test. Since these are measures of aptitude least influenced by 
school-learned abilities, they can be used to distinguish low achievers working 
up to capacity from those with potential for higher achievement. Identifying 
such underachievers with scholastic aptitude tests which depend heavily on 
school-learned abilities is possible but less effective, since the achievement skills 


required to respond to the test are the very ones in which the underachiever is 
most apt to be weak. 


In summary, 
learned abilities, 
dependent on spe 
those of a more 
achievement and s 
of general educat 
are also similar i 


both achievement tests and scholastic aptitude tests measure 
but achievement tests measure those which are more directly 
cific school experiences while scholastic aptitude tests measure 
general nature. This is a matter of degree, however, with 
cholastic aptitude tests becoming very much alike in the area 
ional development, Achievement and scholastic aptitude tests 


n that they are both useful in predicting future achievement. 
8J. C. Merwin and E. F. Gardner, “D. 


Ka yo evelopment and Application of Tests of Educational 
chievement, eview of Educational 


Research, 32, 40-50, February, 1962. 


Tests of Achievement and Scholastic Aptitude 233 


In general, scholastic aptitude tests provide a more convenient measure and 
one that predicts over a wider range of future school experiences. As with types 


of learning measured, these differences are also much less pronounced near the 


middle of the range. 


Group Tests of Scholastic Aptitude 


The majority of scholastic aptitude tests administered in the schools are group 


tests. These are tests which, like the standardized achievement tests, can be 
t one time by persons with relatively 


tests provide a single score 
f separate aspects 


administered to a large number of pupils a 
little training in test administration. Some group 
while others provide two or more scores based on measures o 
of mental ability. 

Single-Score Tests. Scholastic aptitude tests which yield a single score are 
designed to measure the general mental ability of pupils. Such a variety of types 
of items are included in the test that no particular ability or skill receives undue 
emphasis in the total score. Thus, the specific aspects of mental ability (such as 
verbal and numerical reasoning) are blended together into one global measure 
of scholastic aptitude. 

Some single-score tests use a spiral-omnibus pattern in which different types 
of test items are mixed together and placed in increasing order of difficulty. 
These tests generally have one set of simple directions and a relatively short 
testing time, since there are no subtests. Examples of widely used tests of this 
type are the Otis Quick-Scoring Mental Ability Test and the Henmon-Nelson 


Tests of Mental Ability (see Figure 12.1). 


Bird is to feathers os fish is to: 
(1) scales (2) fails (3) song (4) beck (5) mouths caer... ore fe ee sie a s 


45. IÉ the letters c li mare were orronged properly, they would spell: 
tly e a) Americo “Gy clamor iy circle (@) aimless... vest ooo 
tr m isto W allee w= alll w= @ = oJ mm 


y 2's come just before a 6? 
. 00000 


4. 6432973654928753624972863 In this number, how mon) 
(1) one (2) two (3) three (4) four (5) five sosaessos w rns rere ts 


s of Mental Ability, Form A, 
(Copyright 1957 by Houghton 


e Henmon-Nelson Test: 
ke and M. J. Nelson. 


Mifflin Company. Used by permission.) 


Pleure 12.1. Sample items from thi 
tades 3-6. Revised by Tom A. Lam 


f short, separately timed subtests are also 
f this type include the Kuhlmann-Anderson 


Intelligence Tests, Sixth Edition, and the Kuhlmann-Finch Intelligence Tests. 
A relatively new type of group test yielding a single score is the culture-fair 
test. This test type was designed to measure general problem-solving ability free 
from social-class bias, The Davis-Eells Games, which is subtitled the Davis-Eells 
Test of General Intelligence or Problem Solving Ability, is a typical example 
of a culture-fair test. The test items consist of pictures portraying problem situ- 


Single-score tests using a series 0 
widely used in schools. Typical tests © 


234 Using Standardized Tests 


ations familiar to children from all urban cultural groups (see Figure > 
The instructions, which are read to the pupils, are also based on vocal s: on 
common to all American urban children. Thus, reading skill is sakes i 
the tasks are selected so that pupils from all social classes have an equa oppor 
tunity to demonstrate their reasoning ability in situations re eh — 
Although this type of test may have promise in identifying pupils from A p 4 
erished home backgrounds who have good reasoning ability but are wea 


icti i i i from 
educational skills, there is conflicting evidence concerning its freedom 
social-class bias.° 


a II E 


-are 


h Now look at this top row of pictures. Look at the boys and the gate. 
The Each picture has a number on it—No. 1, No. 2, and No. 3, Which 
teacher boy is starting the best way to get over the gate? No. 1, or No. 2 


reads: boy, or No. 3 boy? Draw a line on the right box. 


Now look at the next picture; it is beside ite 
one you just did. TF shows a boy and agil 
waving their hands. Hold your finger on H s 
picture—a boy and a girl waving their han a 
Look at the picture while | tell you about it. 
Which number is right? Mark the right box. 


No. 1 Box: They are waving at a boy- 

No. 2 Box: They are waving at a girl. 

No. 3 Box: We cannot tell from this picture 
whom they ore waving to. 


Be sure to mark a box. 


Figure 12.2. Sample items from the Davis-Eells 
Eells. (Copyright 1952 by Harcourt, Brace & 
by permission.) 


š th 
Games by Allison Davis and eet 
World, Inc. All rights reserved. Reproduce 


Tests with Verbal and Nonverbal Scores. 
scholastic aptitude have been designed to yield s 
or language and nonlanguage scores. In some 


A number of group tests k 
eparate verbal and nonea 
of these tests the verbal and 
nonverbal subtests have been printed in separate booklets so that they can “A 
administered separately. Typical of this type are the Lorge-Thorndike Intett- 


k: s i ‘ nd 
gence Tests. Other tests in this area combine the subtests in one booklet a 
provide a total score as well a 


widely used test of this type is 
items are shown in Figure 12.3, 


A 

S separate language and nonlanguage scores. 
S ê 

the California Test of Mental Maturity. Samp 


: : TE 
The obvious advantage of using both verbal and nonverbal scores is 1! 
oN. E. Wallen, “Dev 


elopment and Application of Tests of General Mental Ability, 
Review of Education 


al Research, 32, 15-24, February, 1962. 


Tests of Achievement and Scholastic Aptitude 


TEST 2 


— DIRECTIONS: 

The first three pictures in each row are of things which are alike in some way. 
Decide how they are alike and then find the picture to the right of the dotted 
line that is most like them and mark its number. 


= DIRECTIONS: 


_Mark the number of the word that means the s 
First word. 


ame or about the same as the 


83. associate 'mix 2define 


F blossom tree 2 vine 
3interpret Astultify 


3 Flower gorden 


84. copy Iplogiarize 2Find 


7l. q 
moze asqa 
3refuse 4deed 


Ssurprise 4contrary 


85. notable terrible 2brilliont 
3severe famous 


72. i 
x s lconsent 
agree 4overlook 


m the California Short-Fo 
3, Grades 7-8. (Copyright 1963 by California Test Bureai 


Figure 12.3. Sample items fro 
u. Used by permission.) 


t two different lev 


identifying learning potential al 
hool success becau 


provides the best prediction of scl 
a prominent role in learning school tasks. 
presence of those abilities needed for imme 
sense this is realized potential supporte 
for effective performance. The nonverbal score, on t 
more satisfactory estimate of a pupil’s un 
pupil with above average mental ability 
have a low verbal score but an above ave 
chances of immediate success in schoo 
deficiency, his chances of future success are 
remedial work. Thus the nonv 
potential of those with low verbal scores. 


235 


| 


rm Test of Mental Maturity, Level 


els. The verbal score generally 
se verbal ability plays such 
Thus, high verbal scores indicate the 
diate success in school work. In a 
d by all of the functional skills necessary 
he other hand, provides a 
realized potential. For example, a 
ho is deficient in reading is apt to 
rage nonverbal score. Although his 
] work are poor because of his reading 
good if he is given appropriate 
erbal score serves as a check on the learning 


236 Using Standardized Tests 


Some group tests of scholastic aptitude provide separate verbal and quan- 
titative scores. These tests are designed to provide differential prediction of 
school success. They are based on the principle that tests of verbal ability are 
best for predicting achievement in courses in which verbal concepts are em- 
phasized and tests of quantitative ability are best for predicting success in 
courses stressing mathematical concepts. Tests of this type include the Coopera- 
tive School and College Ability Test (SCAT) and the Kuhlmann-Anderson Intel- 


Select the missing word. 


1- In order not to ( ) what he had to buy he 
repeated the list as he walked to the store. 


A take B carry C forget D change 
E lose 


Choose the correct answer. 


j. 338, 420 
z140, 621 


A 197,799 
B 197,801 
C 197,809 
D 197,899 
E None of these 


Find the word closest in meaning 
to the capitalized word. 


1 IRRITATE 


A dislike 
B uncover 
C annoy 

D authorize 
E subdue 


Choose the correct answer. 


1 How many inches are there in 2 1/2 feet? 


A 24 
B 26 
C 28 
D 30 
E 32 


perative School and College Ability Test (SCAT), 


Level 2, Grades 10-12. (Copyright 1955 by Cooperative Test Division, Educational Testing 


Service. Used by permission.) 


Tests of Achievement and Scholastic Aptitude 237 


ligence Tests, Seventh Edition. Items from SCAT are presented in Figure 12.4. 

Scholastic aptitude tests providing verbal and nonverbal, language and non- 
language, or verbal and quantitative scores must be interpreted with extreme 
caution. The separate scores usually correlate highly with each other because 
of the general mental ability factor common to both sets of scores. This means 
that differences between the two scores must be relatively large before they can 
be used for diagnostic purposes or differential prediction. A safe procedure is 
to consider the differences between scores as merely clues to be verified by other 
evidence. 

Multiscore Tests. Group tests have also been developed which measure a 
number of specific aptitudes that are relatively independent of each other. One 
of the best known of these tests is based on L. L. Thurstone’s factor analysis 
studies of intelligence in which he identified a number of separate mental abili- 
ties. This is the SRA Primary Mental Abilities (PMA) Test. The 1962 revision 
includes five separate batteries, encompassing all grade levels. It measures the 


following mental abilities at the later elementary level: Verbal Meaning, Number 


Facility, Spatial Relations, Reasoning, and Perceptual Speed. The reasoning test 
perceptual speed test is not used at 


is omitted at lower grade levels and the 

higher grade levels, Sample items from the PMA are presented in Figure 12:5. 
One of the most widely used multiscore aptitude tests at the high school level 

is the Differential Aptitude Tests (DAT). Since this battery of tests was pub- 

lished primarily for guidance use, it includes measures of aptitude which go 


beyond those included in the typical scholastic aptitude tests. The eight tests in 
the battery are designed to measure the following aptitudes: Verbal Reasoning, 


Numerical Ability, Abstract Reasoning, Space Relations, Mechanical Reasoning, 
Clerical Speed and Accuracy, Language Usage: Spelling, and Language Usage: 
Grammar. In addition to a separate score for each test, the Verbal Reasoning 
and Numerical Ability scores can be combined to provide a general measure of 
scholastic aptitude. Also, the Abstract Reasoning score provides a good measure 
of nonverbal scholastic aptitude. As noted earlier, this type of measure serves 

as a check on the learning potential of pupils with language handicaps. 
Sample items for four of the eight areas tested by the DAT are presented 
in Figure 12.6. These same four areas are also included in an aptitude test 
designed for grades 6 to 9, called the Academic Promise Tests (APT). This is 
like a junior high school form of the DAT, but does not include the vocationally 
g, and clerical aptitude. 


Oriented sections on space relations, mechanical reasoning, 
ional planning 


These areas were not considered necessary jn the APT, since vocati 


is usually delayed until high school. i . ae 

A major advantage of a multiscore test like the DAT is that it provides inde- 
pendent measures of aptitude in areas other than those required for strictly 
academic work. Consequently, a pupil with low scores on the typical scholastic 
aptitude measures (that is, Verbal, Numerical, and Abstract Reasoning) can 
obtain an indication of his strengths as well as his weaknesses. Knowing he has 
mechanical or clerical aptitude, for example, is much more helpful for educa- 


tional and vocational planning than merely knowing that he lacks aptitude for 


238 Using Standardized Tests 


VERBAL MEANING Indicate the word which means the same as BIG. 


BIG A. fair B. windy C. soft D. large 


Find the picture of the dog. 


lala 


NUMBER FACILITY Find the missing number in the series. 


2 6 8 10 12 A. 3 B. 4 G S D 1 


Joan earns 50ç an hour for baby-sitting. She baby-sat for two 
hours. How much did she eam altogether? 


A. $.50 B. $1.50 C. $2.00 D. $1.00 
Add the column of figeres and select the correct answer. 


A. 32 B. 33 C. 34 D. 35 


SPATIAL RELATIONS The first drawing is one part of a square. Select the shape 


which completes the square. 


Als|s|> 


Find the drawing that is different from the others. 


/\s| J| Ó 


PERCEPTUAL SPEED 


REASONING 


Find the two drowings thot are exactly alike. 


A B Ç 


|| | T| 


Figure 12.5. Sample items. (Reprinted by permission of Science Research Associates, 


Inc. from PMA Primary Mental Abilities for Grades 4-6 by Thelma Gwinn Thurstone. 
© 1962, Thelma Gwinn Thurstone.) 


Tests of Achievement and Scholastic Aptitude 239 


ABSTRACT REASONING 


WI j " " . 
hich "answer figure" is next in the series? 


PROBLEM 
FIGURES ANSWER FIGURES 


oca GOI 


D E 


NUMERICAL ABILITY 


Select the correct answer for each problem. 


Add 
none of these 


VERBAL REASONING 


Fill the blanks with the pair of words which make the 

sentence true or sensible 

-i s = < * * * 

s to night as breakfast is to morning: 
er and morning; supper fits 


. . is to night as breakf. 


supper — corner Supper i 
gentle — morning Pair E has both su 
door — comer in the blank at the beginning of the sentence 
flow — enjoy and morning fits in the blank at the end. On 
supper — morning the sample Answer Sheet, the space under E 
has been blackened on line Y to show that 


pair E is the right one- 


L 
ANGUAGE USAGE: GRAMMAR 
Which of the lettered parts of each sentence contains errors 


in grammar, punctuation, or spelling? 
SAMPLES OF ANSWER SHEETS 


Ain't we / going to / the office / next week? 
£ D 


A B 


(Reproduced by 


ude Tests (DAT). 
New York, N.Y. 


Differential Aptit 
gical Corporation, 


Fi 
Bure 12.6. Sample items from the 
1962, The Psycholo 


Dermissi 
ae Copyright 1947, © 1961, 
rights reserved.) 


240 Using Standardized Tests 


learning the verbal and numerical concepts presented in the more academic 
subjects.10 


Individual Tests of Scholastic Aptitude 


For the majority of pupils, group tests provide a satisfactory estimate of 
scholastic aptitude. In certain instances, however, it is desirable to obtain scores 
from an individual test. Since the individual scholastic aptitude test is admin- 
istered to one pupil at a time, it is possible to control more carefully such 
factors as motivation and to assess more accurately the extent to which “dis- 
abling behaviors” are influencing the score. The influence of reading skill is 
deemphasized because the tasks are presented orally to the pupil. In addition, 
clinical insights concerning the pupil’s method of attacking problems and his 
persistence in solving them are more readily obtained with individual testing. 
These advantages make the individual test especially useful for testing young 
children and for retesting pupils whose scores on group tests are questionable. 
This includes, as a minimum, all extremely low scores and those which differ 
considerably from the teacher’s estimate of ability. Where educational decisions 
are to have far-reaching influences, such as the placement of pupils in special 
classes for the mentally handicapped, the more dependable individual measure 
of mental ability is also preferred. 

The two most highly regarded individual tests for use with school children 
are the Revised Stanford-Binet Intelligence Scale, 1960 Revision, and the 
Wechsler Intelligence Scale for Children (WISC). For pupils who are age 16 
or older, the Wechsler Adult Intelligence Scale (WAIS) replaces the WISC. 

The 1960 revision of the Stanford-Binet is called Form L-M because it incor- 
porates the best test items from the L and M forms of the 1937 edition. It consists 
of a series of tests arranged by age levels. The tests begin at the two-year- 
old level and continue on up to the superior adult level. At the lowest age levels 
the tests require the child to identify objects, identify parts of the body, identify 
pictures, obey simple commands, and the like. At later ages, the tests include 
a variety of tasks, most of which place heavy emphasis on verbal reasoning 
ability. 

The administration of the Stanford-Binet, like other individual tests, requires 
a specially trained examiner, who meets with the child in a counseling-type 
setting. After establishing rapport with the child, the examiner begins at the 
level at which the child can pass all of the tests. From there, he continues to 
administer tests at successively higher levels until he reaches a level at which 
the child fails all tests. The mental age of the child is then determined by adding 
to the basal age (age level at which he passes all tests) two months credit for 
each test he passes at the higher levels. When the child’s mental age has been 
computed, it can be converted to a deviation IQ by means of tables presented 
in the test manual. 


10 For a brief description of the 
the appendix. For other 
above). 


group scholastic aptitude tests listed in this section, see 
tests in this area, consult Buros’ Tests in Print (see footnote 2, 


Tests of Achieuement and Scholastic Aptitude 241 


rripa e ba E which is organized by age levels, the 
and WAIS) are arranged by subtests. There are eleven 
scored subtests: six of them are grouped together to form a verbal scale, and 
five of them form a performance scale. Like the Stanford-Binet, the tests must 
be administered by a specially trained examiner, and the scores are expressed 
in terms of deviation IQ’s. The Stanford-Binet and Wechsler scales are about 
equally useful for children age seven or above, but the Stanford-Binet is superior 
below that age level.1! 
From the viewpoint of the teacher, who will most likely use the results of 
these tests, rather than administer them, the most important difference between 
the scales is in the types of scores provided. The Stanford-Binet scale provides a 
single score representing a measure of | ability while the Wechsler 


scales provide both verbal and performance scores.!” 


general menta 


Reading Readiness Tests 


A special type of scholastic aptitude test is the reading readiness test used at 
the kindergarten and first-grade level. As indicated by the name, the test is 
designed to determine the readiness of pupils for beginning reading. Although 
the materials included in the test are similar to those found in general scholastic 
aptitude tests used at that level, they are more directly concerned with skills 
and abilities of immediate value in learning to read. Items which require pupils 
to identify, differentiate, and match various letters. words, figures, and sounds 


are most commonly included. 
Typical of the more widely use 
ness Test, Harrison-Stroud Reading 
Readiness Test, Metropolitan Readiness Te 
Reading Readiness Test.13 
Like other tests of scholastic aptitude, 


ness tests include a variety of types of test 
different information concerning the readiness of pupils. Some of the tests, 
ior example, provide a measure of three rather general abilities while others 
provide as many as seven scores on specific readiness skills. Before selecting a 
given test, the teacher should be certain that the test provides the particular 
type of information which is most useful in his reading program. 


d tests in this area are: Gates Reading Readi- 
Readiness Profiles, Lee-Clark Reading 
sts, and Murphy-Durrell Diagnostic 


and tests of achievement, reading readi- 
content and each provides slightly 


Cautions in Interpreting Scholastic Aptitude Test Scores 
Before pointing out some of the cautions to be observed in interpreting the 
it is necessary to describe briefly 


scores derived from scholastic aptitude tests, 


š ™R. L. Thorndike and E. Hagen, Measurement and Evaluation in Psychology and Educa- 
tion (New York: John Wiley & Sons, 1961) - : 

12 Readers who would like more detailed descriptions of these individual tests of intelli- 
gence should consult Cronbach (sce footnote 6, above) or Thorndike and Hagen (see foot- 
note 11, above). 

18 For a brief description of these tests, see the appendix. For other readiness tests, see 
Buros Tests in Print (see footnote 2, above) - 


242 Using Standardized Tests 


the most frequently used scores. This description will be very brief, one 
since Chapter 14 is devoted to the various types of derived scores used in stan 
ardized testing. 

Mental Age and Intelligence Quotient. Two of the most common wpa 
of scores used with scholastic aptitude tests are the mental age (MA) scote nae 
the intelligence quotient (IQ). The mental age score indicates a pupil’s level D 
mental development. A pupil with a mental age of 12, for example, has menta 
ability equivalent to the average twelve-year-old. The intelligence quotient — 
to a pupil’s rate of mental development. The conventional ratio IQ, used in mos 
of the older intelligence tests, was calculated by the following formula: 


Mental Age (MA) 
100 = I 
Chronological Age (CA) % 100 Q 


Thus, if a pupil with a chronological age (CA) of 10 had a mental age (MA) 
of 12 his IQ would be 120. This would indicate that his mental development was 
proceeding at a faster rate than that of the average child. It will be noted from 
the formula, that when a pupil’s mental age (MA) and chronological age (CA) 
are equal his IQ will be 100, and if his mental age (MA) is lower than 2 
chronological age (CA) his IQ will be less than 100. The IQ has been highly 
regarded as an index of intelligence because it supposedly indicates the same 


: “1: : M : iven 
relative ability at different age levels and remains fairly constant for any giv 
individual. 


Although the ratio I 
mental ability tests, 
This is an IQ base 
and a standard de 
much the same wa 
will be considered 
are discussed. 


Cautions in Interpreting IQ Scores. Many of the misinterpretations and 
misuses of IQ scores have been due to the false belief that a person's IQ, like 
his blood type, is a fixed quantity which does not vary. Since the IQ is a score 
derived from tests which include a variety of types of mental tasks, which have 
less than perfect reliability, and which measure such a variable quality a2 
human performance, this expectation is extremely unrealistic. Variations in Q 
from one time to another can be expected, even under the most ideal patna 
conditions. Proper use of the IQ score requires that this variation be recognize 
and taken into account rather than ignored. 


Assuming that the basic conditions of testing, 
development 


following typ 


Q, based on the above formula, is still used in some 
it has been replaced in the newer tests by the deviation 1Q. 
d on standard scores with a mean (average score) of 100 
viation of 15 or 16. The deviation IQ can be interpreted in 
y as the ratio IQ, but it has some definite advantages. These 
in Chapter 14, where the relative merits of standard scores 


such as equal opportunity a 
$ FE ; e 
and maximum motivation, have been met, we can still expect t 
es and ‘amounts of variation in the IQ score. 


1. An IQ from the same test can be expected to vary from 5 to 10 points on 
the basis of the standard error alone. Thus, an IQ of 105 can be most safely 
interpreted as a band of IQ scores ranging from 95 to 115. 

2. When IQ scores from different tests are compared, the IQ can be expected 
to vary several points for scores in average and below-average ranges and 10 


Tests of Achieuement and Scholastic Aptitude 243 


or more points for high IQs. Since each test measures slightly different aspects 
of mental ability and each is standardized on a different population, the IQs 
are not directly comparable. In interpreting IQ scores, it is desirable to know 
the test from which it was derived. 
i 3. E variation in 1Q scores can be expected for elementary school pupils 
an for high school pupils. This can be accounted for largely by the fact that 
abilities tend to be less stable during their formation.!! Where group tests are 
used, variation is still greater because the control of conditions necessary for 


maximum performance is seldom possible. 


The variations in IQ discussed above are normal variations that can be 
expected under ideal testing conditions. Although such variation might be dis- 
turbing to those who expect the IQ to be perfectly constant, it should not be 
upsetting to teachers who want to use IQ scores for practical purposes. Know- 
ing a pupil’s IQ is somewhere between 85 and 95 or between 115 and 125 is 
ee accurate to provide an extremely useful appraisal of his ability to 

rn. 

The greatest problems in interpret 


normal variations in the IQ. These we can 
allowances for, The major problems arise when IQ scores contain an indetermi- 


nate amount of error because the basic conditions of aptitude testing have not 
been met. In general, IQ scores are less dependable for the following types of 


pupils: 


ing and using IQ scores are not caused by 
estimate fairly accurately and make 


1. Those whose home environment is sufficiently barren to prevent full oppor- 


tunity to learn the types of tasks included in the test. 

2. Those who are little motivated by school-type tasks. 

3. Those who are weak in reading skills or have a language handicap. 

4. Those who have poor emotional adjustment. 
n noted where environmental deprivation was 
ame deficiencies in educational skills, and 
died. Although such changes can be inter- 
ary conditions for adequate testing, the 
has been influenced by such factors can 
the only safe procedure is to con- 


Radical changes in IQ have bee 
removed, where remedial work over 
where emotional problems were reme 
preted in terms of fulfilling the necess 
extent to which an individual IQ score 


seldom be fully determined. Consequently, 
sider a single IQ score as a highly tentative estimate which needs verification 


by other test results and by classroom observation. As the results of several IQ 
tests accumulate and other evidence of ability is added, a fairly dependable 


estimate of scholastic aptitude will emerge. 


SUMMARY 


Standardized tests are characterized by items of high quality, rigid directions 


for administering and scoring, and norms for interpreting the results. The most 


ML. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960) . 


244 Using Standardized Tests 


common types of standardized tests used in teaching are achievement tests and 
scholastic aptitude tests. 

Standardized achievement tests complement rather than replace teachers’ 
informal classroom tests. They are especially useful for measuring general edu- 
cational development, determining pupil progress from one year to the next, 
grouping pupils, diagnosing learning difficulties, and comparing achievement 
with scholastic aptitude. They are of little value for measuring learning out- 
comes unique to a particular course, day-to-day progress of pupils, and knowl- 
edge of current developments in rapidly changing fields. These latter purposes 
are more effectively served by informal classroom tests. 

Achievement test batteries are widely used at the elementary school level. 
They cover the basic skills (ie., reading, language, and arithmetic), and some 
batteries also include sections on work-study skills, science, and social studies. 
Test batteries are less widely used at the high school level because of the dif- 
ficulty of identifying a common core of content. Batteries at this level are 
confined to the basic skills, to content included in the basic high school subjects, 
or to measures of general educational development. The main advantage of a 
test battery is that a pupil’s strengths and weaknesses in different areas can be 
determined. The complete test battery seldom fits the instructional objectives 
of the school, however, and this must be taken into account when interpreting 
the results. 

In addition to achievement test batteries, there are a number of separate 
subject-matter tests, survey reading tests, and diagnostic tests. While these can 
be more readily adapted to instructional objectives than complete batteries, they 
must be selected carefully in terms of the specific uses to be made of the results. 
Individual tests within each of the above areas vary considerably in the pupil 
characteristics they measure. 

Scholastic aptitude tests measure performance based on learned abilities 
from which potential for future learning is inferred. They differ from achieve- 
ment tests in that they measure learned abilities of a more general nature (e.g: 
verbal and numerical reasoning), which are less dependent on specific school 
experiences. Some scholastic aptitude tests can be administered to groups of 
pupils while others must be administered to one individual at a time. 

Group tests of scholastic aptitude may yield a single score, separate verbal 
and nonverbal scores, or multiple scores based on a series of specific aptitudes. 
The single-score test is designed to measure general mental ability only. In tests 
using verbal and nonverbal scores, the nonverbal score serves as a check on 
the learning potential of poor readers. Multiscore tests measure a variety of 
Separate aptitudes which are predictive of success in academic and nonacademic 
areas. 

Individual tests of scholastic aptitude deemphasize reading skill and provide 
more carefully controlled testing conditions. Thus the individual test is espe- 
cially valuable for testing young children and for checking on questionable 
scores obtained with group tests. Special training is required to administer 
individual tests but the scores are interpreted in the same manner as those 


Tests of Achievement and Scholastic Aptitude 245 


obtained from group tests. Reading readiness tests are a special type of scholastic 
aptitude test designed to determine the readiness of pupils for beginning reading. 
ptitude test scores requires recognition 
f factors causing extreme variation. 
e estimates to be verified 


Proper interpretation of scholastic a 
of normal variation in the scores as well as o 
A safe procedure is to view test scores as highly tentativ 
by other available evidence. 


SUGGESTIONS FOR FURTHER READING 


New York: Macmillan, 1961. Chapters 


Anastasi, Anne. Psychological Testing. 2nd edition, 
batteries. Chapters 16-17. Achievement 


8-13. Intelligence tests and multiple aptitude 
tests. 

Cronbach, L. J. Essentials of Psychological Testing. New York: Harper & Row, 1960. Chap- 
S 7-12. Mental ability tests and multiple aptitude batteries. Chapter 13. “Proficiency 

ests.” 

Davis, F. B. Educational Measurements and Their Interpretation. Belmont, California: Wads- 

wort Publishing Co., 1964. Chapter 5: “Measurement of Achievement.” Chapter 6: 
' Measurement of Intelligence and Aptitude.” 

Findley, W. G. (ed.) The Impact and Improvement 
Yearbook of the National Society for the Stu 
University of Chicago Press, 1963. Chapter 8: 
Improvement of Tests.” 

Goslin, D. A. The Search for Ability: Standardized Testing in Social Pe 
Russell Sage Foundation, 1963. Chapter 4: “Testing in Education.” 


of School Testing Programs, Sixty-second 
dy of Education, Part II. Chicago: The 
Engelhart, M. D., and Beck, J. M., “The 


spective. New York: 
° Chapter 6: “What 


Ability Tests Measure.” s: 
Harris, C. W. “Intelligence,” Encyclopedia of Educational Research, 3rd edition, New York: 
Ñ Macmillan, 1960. Pages 715-717. p 
unnally, J. C. Educational Measurement and Evaluation. New York: McGraw-Hill, 1964. 
Chapter 11: “Factors of Intellect.” 
Some Persistent Issues,” Journal of 


Stroud, J. B. “The Intelligence Test in School Use: 
i Educational Psychology, 48, 77-85, 1957. ; 
orrance, E. P. Guiding Creative Talent. Englewood one 
Chapter 2: “Assessing the Creative Thinking Abilities. 
ests of Creative Thinking.” ion Ù 
Thorndike, R. L., and ae Hagen. Measurement and Evaluation in Psychology and 
Education. New York: John Wiley & Sons, 1961. Chapters 9-11. Intelligence, eae 
and achievement tests. 


New Jersey: Prentice-Hall, 1962. 
” Chapter 3: “The Minnesota 


Test Bulletins 
Good, W. E. Misconceptions about Intelligence Testing. Test Service Bulletin, No. 79, New 
5 York: Harcourt, Brace & World, 1954. 

eashore, H. G. The Identification of the Gift 
w The Psychological Corporation, 1963. 

esman, A. G. Aptitude, Intelligence, a” 
York: The Psychological Corporation, 


ed. Test Service Bulletin, No. 55, New York: 


d Achievement. Test Service Bulletin, No. 51, New 


1956. 


Chapter B 
selecting 


and using 
standardized tests 


| 


There are numerous standardized tests from which to choose. . - + Your 
task is to locate, evaluate, and select those tests which best suit your seal 
needs. ... A systematic procedure facilitates this process... . Careful a 
ministering and scoring is necessary for valid results. . . . There are many 


1 ide test- 
uses of standardized tests. . . . These are best served by a school-wide 
ing program. 


The use of standardized tests in school has increased rapidly over the past 
decade. It has been estimated that more than 120 million standardized tests are 
now being administered annually.’ This is approximately three tests for ere 
pupil enrolled in school. If these tests are to make their maximum me 
to the development and guidance of individual pupils they must be selected an 
used with the utmost of care. N 

Some standardized tests are selected by individual teachers, but more a 
monly the tests are selected in accordance with the school testing program. M 
either case, the teacher is usually active in the selection process since or 
testing programs are cooperatively planned by teachers, guidance workers, ae 
administrators. In large school systems where it is necessary to use porga 
for this purpose, the individual teacher still has a voice in the process throug 
departmental and general staff meetings. š. 

It is important that teachers participate actively in the selection of standar 
ized tests. Their participation provides greater assurance that tests are i 
harmony with instructional objectives of the school and that the results wi 


" ch, 
1F. B. Davis, “Testing and the Use of Test Results,” Review of Educational Research 
32, 5-14, February, 1962. 


246 


Procedures for Selecting and Using Tests 247 


serve the various instructional uses for which they are intended. Although stand- 
ardized tests can serve a variety of administrative and guidance functions, of 
central concern to any testing program is the effective use of test results in 
evaluating and improving the educational development of pupils. It is for this 
reason that teachers should play a key role in selecting the tests to be used in 
the school testing program and, consequently, become intimately familiar with 
the procedures of test selection. 

Teachers must also know the procedures for administering and scoring stand- 


ardized tests. In some schools they participate directly in these functions while in 
ther case, however, teacher understanding 


others special personnel is used. In ei 
ffective interpretation and use of test 


of the procedures contributes to more € 
results. 


SOURCES OF INFORMATION ABOUT SPECIFIC TESTS 
" There are a number of sources of information concerning specific standard- 
ized tests, These will be presented in the approximate order in which they should 
be consulted when searching for information to further the selection of tests. 


Buros’ Guides 
cating the published tests available in a 


The most useful single guide for lo ailal 
2 This is a comprehensive bibliography 


particular area is Buros’ Tests in Print. 
of tests used in education, psychology: and industry. An attempt was made to 


include a bibliographic entry for all tests printed in the English language as of 
June 1, 1961. Of the 2,138 tests still in print, 1,875 of them were published in 
the United States. The number and percentage of tests published in the United 
States in each major area of measurement arè presented in Table 13.1. 

The number of tests available in any given area makes the selection of the 
proper test seem like an almost insurmountable task. However, many of the 


tests can be disregarded quickly because they are recommended for experimental 
D A . J 

or research purposes only, or the publication dates indicate they are too old 

ion makes clear that they are 


to be ver i 
useful, or the bi J 
1 nee n Tests in Print includes the 


in . š 

i appropriate for a particular use- ya a 

ollowing information, where relevant, to aid this original screening process. 
> 


Test title. 

Grade levels for each test booklet. 
Publication dates. 

Special comments on test. 
Number and type of scores provi 
Authors. 


Publisher. 
. Reference to reviews of the test in Buros’ Mental Measurements Yearbooks. 


ies presented in Test. 


ded. 


en awa YN 


Since the bibliographic entr s in Print consist of non- 


2 O. K. Buros, Tests in Print (Highland Park, New Jersey: Gryphon Press, 1961). 


248 Using Standardized Tests 
Table 13.1 


IN PRINT TESTS PUBLISHED IN THE UNITED STATES 
BY MAJOR CLASSIFICATIONS* 


Classification Number Percentage 
Character and Personality 290 15.5 
Vocations 259 13.8 
Miscellaneous 219 11.7 
English 171 9.1 
Mathematics 154 8.2 
Intelligence 153 8.2 
Reading 137 7.3 
Social Studies 112 5.9 
Science 103 5.5 
Foreign Languages 90 4.8 
Business Education 52 2.8 
Sensory-Motor 51 2.7 
Achievement Batteries 41 2.2 
Fine Arts 28 15 
Multi-aptitude Batteries 15 0.8 
Total 1875 100.0 


* Adapted from Buros, Oscar K., Tests in Print, page xx, copyright © 1961 by Oscar 


Krisen Buros, published by The Gryphon Press (Highland Park, New Jersey). Used by per 
mission of the publisher. 


evaluative descriptive information concerning each test, the reference to the 
Mental Measurements Yearbooks? is especially valuable. In addition to the kind 
of information presented in Tests in Print, the Yearbooks include such items 
as cost of the test, number of forms available, administration time, and whether 
norms are available. Of greatest importance, however, are the critical reviews 


they contain. Every test is reviewed b 


y two or more specialists qualified by 
traini 


ng and experience to evaluate the tests. Since the reviewers are professional 
persons with no “ax to grind,” they do not hesitate to point out test weaknesses 
as well as any exaggerated claims made by test publishers. In addition, of courses 
they indicate strengths of a test and the uses for which it is best suited. Follow- 
ing the reviews there is usually a bibliography of journal articles pertaining to 
the test. Still further evaluative information can be obtained by referring F” 
these articles. The Yearbooks are published periodically but not on a definite 
schedule. A new one should appear in 1965. 


As aids for locating tests for a particular use, Tests in Print provides an excel- 


30. K. Buros, The Fifth Mental Measurements Yearbook (Highland Park, New Jersey: 


Gryphon Press, 1959). O. K. Buros, The Fourth Mental Measurements Yearbook (Highland 


Park, New Jersey: Gryphon Press, 1953). O. K. Buros, The Third Mental Measurements 


Yearbook (New Brunswick, New Jersey: Rutgers University Press, 1949). O. K. Buros, The 
1940 Mental Measurements Yearboo 


k (Highland Park, New Jersey: Gryphon Press, 1941)- 
9. K. Buros, The 1938 Mental Measurements Yearbook (New Brunswick, New Jersey: Rut- 
gers University Press, 1938) . 


Procedures for Selecting and Using Tests 249 


lent guide to the tests available and the Mental Measurements Yearbooks pro- 
vide the information needed to evaluate the tests from technical and practical 
standpoints. The cross references to the Yearbooks in Tests in Print make the 
screening procedures of test selection a simple two-step process. Together these 
volumes are of such value for selecting and evaluating tests that it is hard to 


imagine a school library without them. 


Test Publishers’ Catalogues 


The most recent information concerning tests available for school,use can be 
These usually contain brief descrip- 


obtained from test publishers’ catalogues. 

tions of each test, including possible uses of the test, cost, administration time, 
and similar information. If a publisher’s claims for his tests are checked by 
independent reviews such as those presented in the Mental Measurements Year- 
books and professional journal articles, test catalogues provide a good source of 
information. They are especially useful for locating new tests and recent editions 
of earlier tests, A list of test publishers, who will send catalogues upon request, 


is included in the appendix. 


Review of Educational Research 


An extremely helpful guide for locating research articles concerned with the 
use of tests in schools is the Review of Educational Research.* Approximately 
every three years one issue of this journal is devoted to a summary of research 


regarding educational and psychological testing. Separate chapters are usually 
] mental ability tests, methods of appraising 


devoted to achievement tests, genera 

Personality, tests of special aptitudes, the use of test results, and statistical 
methods. In addition to summarizing the research in these areas, extensive 
bibliographies are provided. The issues of the Review of Educational Research 
devoted to testing provide easy access to the research data pertinent to particular 
standardized tests. 


Professional Journals 
_ Up-to-date information on testing can. also be obtained from professional 
journals concerned mainly with testing problems. One of the most useful is 
Educational and Psychological Measurement. This journal includes test reviews 
and reviews of new books in measurement, as well as articles on the development 
and use of tests. Other journals providing test reviews are the Journal of Con- 
sulting Psychology and the Personnel and Guidance Journal. Numerous other 
Professional journals also include occasional articles on testing that contain 
information of an evaluative nature- These articles can be most easily located 
through the use of such bibliographic sources as the Education Index and Psy- 
chological Abstracts. Pertinent references to testing may be located in these 


guides by looking under such headings as achievement testing, educational 


4 E. F. Gardner, “Educational and Psychological Testing,” Review of Educational Research, 


32, No. 1, February, 1962. 


250 Using Standardized Tests 


measurement, evaluation, intelligence tests, psychological tests, testing programs, 
and tests and scales. 


STEPS IN SELECTING STANDARDIZED TESTS 


Standardized tests play such a vital role in the school program that they should 
be selected with the utmost of care. Tests which are selected hastily or casually 
seldom provide adequate or appropriate information on which to base educa- 
tional decisions. In addition, such tests frequently have an undesirable influence 
on the school curriculum because they usually are not in complete harmony with 
the instructional program of the school. These and other pitfalls can be avoided 
by using a systematic selection procedure which takes into account the educa- 
tional objectives of the school, the role of standardized tests in the total evalua- 
tion program, and the technical qualities most desired in educational tests. The 
following sequence of steps provides such a systematic procedure. 


Defining Needs 


igent selection can be made from among the great variety of 
available in a particular area, it is necessary to define 
Specifically the type of information being sought through testing. In selecting 


achievement tests, for example, it is insufficient to search for a test to “evaluate 
achievement in social studies,” to “ 


nose strengths and weaknesses i 
measure somewhat d 


knowledge, skill, and understanding, 


standardized tests 


> for grouping pupils, for vocational planning, 
and mathematics courses. Each function 


Procedures for Selecting and Using Tests 251 


se for example, if specific learning outcomes in the knowledge area were 
ady being adequately measured by informal classroom tests. Similarly, if a 
scholastic aptitude test is to be used at a particular grade level for grouping 
a only, it might be desirable to group pupils on the basis of achieve- 
sa is and to replace the aptitude test with a diagnostic reading test. 
ee dig ar decisions can be made only when the need for standardized 
iewed in terms of the general evaluation program. 

s Tba in the school situation also help narrow the choice. If, for 
i ters the sc hool lacks a person with the training and experience required 

inister individual tests, only group tests need be considered. If the tests 
are to, be administered by teachers who are relatively inexperienced in test 
administration, those with simple directions are to be favored. If the same type 
na 2 battery is desired for both the elementary and high school levels, 
reel es. batteries with tests at all grade levels need be examined. Considera- 
mek ch as these provide additional criteria concerning the types of tests to 


Locating Suitable Tests 

es of standardized tests desired, a list of 
Buros’ Tests in Print and test publishers’ 
sufficient information to determine 
ther provides the type of informa- 


With a general outline of the typ 
Possible tests can be compiled from 
catalogues, Though both sources will provide 
which tests should be considered further, nei 


ti 
ion needed to evaluate the quality of the test. 
The list of possible tests to consider can be reduced to relatively few in each 


area by consulting critical reviews. described earlier, in Buros’ Mental Meas- 
urements Yearbooks. These reviews are sufficiently detailed to weed out those 
tests which are clearly inappropriate or which have glaring technical weak- 
nesses, Further information of an evaluative nature can, of course, be obtained 
from other sources such as those also described in an earlier section. 


Obtaining Specimen Sets 

ocedure is to obtain specimen sets so that 
ves can be carefully evaluated. Test pub- 
s for each test they publish. These can 
d include a test manual, a test booklet, 
and large school systems 
ailable, however, the 


fae next step of the selection pr 
lishe anuals and test items themsel 
b rs generally provide specimen set 

e purchased at relatively low cost an 
= scoring keys. A majority of universities, colleges, 
maintain a file of such specimen sets. If such a file is unav 
sets can be ordered from test publishers’ catalogues. 


Reviewing the Test Materials 


k The test manual (sometimes accompanied by related aids) usually provides 
© most complete information for judging the appropriateness and the technical 
qualities of a test. A good test manual includes the following types of infor- 


mation: 


252 Using Standardized Tests 


1. Uses for which the test is recommended. 

2. Qualifications needed to administer and interpret the test. 

3. Validity: Evidence of validity for each recommended use. 

4. Reliability: Evidence of reliability for recommended uses and an indi- 
cation of the equivalence of any equivalent forms provided. 

5. Clear directions for administering and scoring the test. 


6. Adequate norms, including a description of the procedures used in obtain- 
ing them. 


In addition, some test manuals (or supplements to the manuals) provide sug- 
gestions and guides for interpreting and using the results, These are especially 
helpful for determining the functions for which the test is best suited. 

In reviewing test manuals in the above areas, the main thing to look for is 
evidence. General statements about validity, reliability, or the adequacy of 
norms should be disregarded unless they are supported by more detailed descrip- 
tions of the procedures used and by statistical evidence of the type discussed 
in Chapters 4 and 5. In fact, unsupported claims in these areas is sufficient 
cause to eliminate a test from further consideration, 

Before making a final selection, it is also highly desirable to study the indi- 
vidual test items carefully. The best method of doing this is to attempt to 
answer each item, as if taking the test. In the case of achievement tests, it is 
also helpful to classify the items by means of a previously prepared table of 
specifications, Although time consuming, there is no better means of determin- 
ing the extent to which a test is appropriate for measuring those knowledges, 


skills, and understandings emphasized in the instructional program. 


Making the Final Selection 


The final choice of tests requires a careful a 
nesses of each test in view of the intended 
fessional judgment which depends as much 


which the tests will be used and the types of 
is needed, as u 


ppraisal of the strengths and weak- 
uses of the results. This is a pro- 
upon the goals of the school in 
decisions for which information 
pon the technical qualities of the tests themselves. Turning, at 
this juncture, to test publishers, test experts, or anyone else who is unfamiliar 
with the particular testing needs of the school is no longer helpful. The final 


decision must be made by persons intimately familiar with the local school 
situation. 


Some teachers attempt to put the final selection on an objective basis by 


assigning a given number of points to validity, reliability, ease of scoring, and 
each of the other desired test characteristics. This practice is not recommended 
since the value of any given characteristic depends on the presence or absence 
of other qualities, For example, if a test is inappropriate for a given use, its 
reliability and ease of scoring are of no particular importance. Likewise, if two 
tests appear equally appropriate but one has very low reliability, the other 
qualities of the test become relatively insignificant. Rather than a numerical 
comparison of test qualities, what is needed is a logical analysis and comparison 


Procedures for Selecting and Using Tests 253 
of the tests in light of the particular needs of the school program. Although this 
is much more demanding, it provides greater assurance that the most appro- 


priate tests will be selected. 


Using a Test Evaluation Form 

ion will be simplified if a standard form is 
tests. The use of such a form 
it increases the likelihood that 
ossible a summary com- 


The entire process of test select 
used when gathering information about specific 
provides a convenient method of recording data, 
pertinent information is not overlooked, and it makes p 
parison of the advantages and limitations of each test. Although minor adapta- 
tions may be desirable to fit a particular situation, the following types of 


information are typically included in a test evaluation form: 


Identifying Data: 


1. Title of test. 

2. Authors. 

3. Publisher. 

4. Date of publication. 


General Information: 


5. Nature and purpose of test. 

6. Grade or age levels covered. 

T. Scores available. 

8. Method of scoring (hand or machine). 
9. Administration time. 

10. Forms available. 

11. Cost (booklet and answer sheet). 


Technical Features: 


12. Validity: Type of validity and 
predictive, and concurrent) - fa EH 

13. Reliability: Type of reliability and nature of evidence (stability, internal 
consistency, and equivalence o; : 

14. Norms: Type, adequacy, and approp™! 


nature of evidence (content, construct, 


f forms). N 
ateness to local situation. 


Practical Features: 


15. Ease of administration (procedure and timing). 


16. Ease of scoring. 
17. Ease of interpretation. 


18. Adequacy of test manual and accessory materials. 


General Evaluation: 


19. Comments of reviewers. 


20, Summary of advantages and limitations for local use. 


254 Using Standardized Tests 


Although the test evaluation form provides a useful summary of information 
concerning tests, it must be reemphasized that no test should be selected on the 
basis of this information alone. How well a test fits the particular uses for which 
it is being considered is always the main consideration and there is no substi- 
tute for studying the actual test materials to determine this relevance. 


ADMINISTERING STANDARDIZED TESTS 


The procedures for administering group tests of achievement and scholastic 
aptitude are such that the tests can be successfully administered by any con- 
scientious teacher. The main requirement is that the teacher rigorously adhere 
to the testing procedures prescribed in the test manual. To do this, the teacher 
must shift roles from that of teacher and helper to that of an impartial test 
examiner who will not deviate from the test directions. 

Teachers sometimes wonder why it is important to follow the test procedures 
so exactly. What harm is there in helping a pupil if he does not understand the 
question? Why not give the pupils a little more time if they are almost finished? 
Are not some of these directions a bit picayunish, anyway? The answer is that 
a standardized test must be administered under standard conditions if the results 
are to be meaningful. When a standardized test is administered to representa- 
tive groups of pupils for the purpose of establishing norms and for determining 
reliability and validity, it is administered in exact accordance with the procedures 
prescribed in the test manual. Unless we adhere strictly to the same procedures, 
the standard conditions of measurement are broken and we cannot use the test 
norms to interpret our scores, Similarly, the information on reliability and 
validity provided in the test manual does not apply unless standard conditions 
of administration are maintained. In short, when the procedures for adminis- 
tering a standardized test are altered, the test becomes no more than a well- 
constructed informal test and the basis for interpreting the test scores is lost. 


Steps in Preparing for Test Administration 


The following steps assume that the regular classroom teacher will administer 
the tests. Although in some schools a test specialist carries out this function, 
there are advantages in using the classroom teacher. In administering the test, 
he becomes more familiar with the test content and he has an opportunity to 
observe the pupils’ responses to the testing situation. With this type of informa- 
tion, he is more apt to interpret and use the test scores intelligently. 

Order and Check the Testing Materials Well in Advance. Test ma- 
terials should be ordered so that they w 
test date. This allows time for correcting 


teachers to study the test materials and 
test. 


$ A ê 
ill arrive a week or more before th 
any errors in the order and for the 
obtain practice in administering the 


When the test materials arrive, they should be opened and checked to make 
certain that the correct form of the test has been sent, that there are a suffi- 
cient number of tests and answer sheets, and that all other ordered materials 


Procedures for Selecting and Using Tests 255 


have been included. This is also a good time to assemble all other materials 
needed for the administration of the test. Extra pencils, a stop watch, scratch 
paper, and other necessary materials can be placed with the tests so that they 
will be available at the time of testing. Most test examiners use a checklist of 
needed materials to provide assurance that nothing is overlooked in preparing 
for the test. 

Select a Suitable Location for Testing. The room which provides the 
most favorable physical and psychological conditions for testing should be used. 
In most instances this is the regular classroom. The lighting and ventilation are 
generally good and pupils are accustomed to taking tests in their own class- 
room. If there are disturbing street noises, or other distracting influences that 
cannot be controlled, it would, of course, be desirable to move the pupils to a 
more quiet classroom. 

At times it may be necessary, or desirable, to combine classes and administer 
the test in an auditorium or some other large room. When this is done, special 
attention must be given to the writing space afforded each pupil. Inadequate 
space and an inconvenient’ place to write can have an adverse affect on test 
scores. Also, seating should be arranged so that the pupils can see and hear 
the test examiner and the proctors can move freely among the pupils. If possible, 
teachers should proctor their own pupils. This provides pupils with a greater 
feeling of security, removes any hesitancy they might have about asking ques- 
tions concerning mechanics or unclear directions, and enables the teacher to 
observe their test behavior. Š 

Make Provisions to Prevent Distractions. There should be no interrup- 
tions or distractions of any kind during testing. These can usually be prevented 
by (1) posting a sign on the door notifying visitors that testing is in progress, 
(2) making certain that all test materials, including extra pencils and erasers, 
are available at the time of testing, and (3) making arrangements to terminate 
temporarily announcements over the loud speaker, the ringing of school bells, 
and other sources of distraction. š ae 

There is u a type of distraction that has to be considered. This oN 
mental distraction that occurs just before and just after an important athletic 
Contest, school dance, school election, and the like. These cape a can best 
be avoided by proper scheduling of the test. Since major schoo events se 
scheduled well in advance, they can easily be avoided when determining the 
date of testing. 

Study the Test Materials and 
matter how simple the administrati 


Practice Administering the Test. No 
on of a particular test may appear, it is 


always wise to read the test manual, take the test yourself, and administer it 
to some other person before administering it to a group of pupils. This famil- 
iarity with the test materials makes it possible to anticipate quésnons and to 
know the types of answers that can and cannot be given in clarifying directions. 
The understanding of the test obtained with these procedures will, of course, 
also increase the effectiveness with which the test results are used. Test scores 
become much more meaningful when the test procedures are intimately known. 


256 Using Standardized Tests 


If more than one teacher is giving the same test, practice in administering the 
test can be obtained by testing each other. When this is done the directions, 
timing, and other aspects of administration should be followed in the same 
precise manner that is to be used in testing pupils. This procedure affords the 


greatest likelihood of detecting and correcting weaknesses in reading directions 
and keeping time accurately. 


Test Administration Procedure 


When proper preparations have been made, the administration of group 
standardized tests is a relatively simple procedure. Basically, it involves the 
following four tasks: (1) motivating the pupils to do their best, (2) adhering 
strictly to directions, (3) keeping time accurately, and (4) recording any sig- 
nificant events which might influence test scores. s 

Motivating Pupils. The goal of all standardized testing is to obtain maxi- 
mum performance within the standard conditions set forth in the testing proce- 
dures. We want each person to earn the highest score he is capable of achieving. 
This obviously means that each person must be motivated to put forth his 
best effort. Although some persons will respond to any test as a challenge to 
their ability, others will not work seriously at the task unless they are con- 
vinced that the test results will be beneficial to them. 

In school testing, we can stimulate pupils to put forth their best effort by 
convincing them that the test results will be used primarily to help them 1m- 
prove their learning. We can also make clear to them the value of standardized 
test results for understanding themselves better and for planning their future. 
These need to be more than hollow promises made at the time of testing, how- 


D 
ever. Test results must be used in the school in such a manner that these benefits 


to the pupils are clearly evident. 

Before administering a particular test, the teacher should indicate to the 
pupils the purpose of the test and the specific uses to be made of the results, 
At this time, the advantages of obtaining a score which represents the pupils 
best efforts should be emphasized, but care should be taken not to make the 
pupils overly anxious. Verbal reassurance that the size of the score is not aS 
important as the fact that it represents his best effort is usually helpful. The 
judicious use of humor can also offset test anxiety to a degree. The most effective 
antidote, however, is a positive attitude toward test results. When the pupils are 
convinced that valid test scores are beneficial to their own welfare, both test 
anxiety and motivation tend to become problems of minor concern. 

Adhering Strictly to Directions. The importance of following the direc- 
tions given in the test manual cannot be overemphasized. Unless the test 15 
administered in exact accordance with the standard directions, the test results 
will contain an indeterminate amount of error and thereby be uninterpretable 
in terms of the test norms provided. 

The printed directions should be read word for word in a loud, clear voice: 
They should never be paraphrased, recited from memory, or modified in any 
way. The oral reading of directions will usually be vastly improved if the 


Procedures for Selecting and Using Tests 257 


directions have been mastered beforehand and they have been read several times 
in practice administrations. 

After the directions have been read, and during the testing period, some 
pupils are likely to ask questions. It is usually permissible to clarify directions 
and answer questions concerning mechanics (for example, how to record the 
answer), but the test manual must be your guide in answering pupils’ questions. 
If the manual states that the pupils should be referred back to the directions, 
for example, this should be followed exactly. In some scholastic aptitude tests, 
the ability to follow directions is a part of the test and to clarify the directions 
is to give unfair aid to the pupils. Where it is permissible to clarify directions, 
care must be taken not to change or modify the directions in any way during 
the explanation. 

Teachers find it extremely difficult to refrain from helping pupils who are 
having difficulty in answering items on a standardized test. When questioned 
about a particular test item, there is a considerable temptation to say: “You 
remember, we discussed that last week,” or to give some similar hints to the 
pupils. This, of course, merely distorts the results. When asked about a par- 
ticular test item, during testing, the teacher should quietly tell the pupil: “I’m 
sorry but I cannot help you. Do the best you can.” ' 

Keeping Time Accurately. If the test contains a series of short subtests 
which must be timed separately, it is desirable to use a stop watch when giving 
the test. For most other purposes, a watch with a second hand is satisfactory. 
To insure accurate timing, a written record of the starting and stopping times 
should be made. This should be in terms of the exact hour, minute, and second 


as follows: 
Hour Minute Second 
Starting time 2 10 0 
Time allowed 12 
2 22 0 


Stopping time 
adhered to rigidly. Any 


As with the wri syecti he time limits must be 
e rections, the time x 
a es ; ill violate the standard 


deviation from the time limits given in the test manual w 
conditions of administration and invalidate the test results. : 

Recording Significant Events. During testing, the pupils should be care- 
fully observed and a record made of any unusual behavior or event which 
might have an influence on the test scores: If, for example, a pupil appears overly 
tense and anxious, sits staring out of the window for a time, or seems to be 
marking his answers in a random manner without reading the questions, a 
description of the behavior should be recorded. Similarly, if there are inter- 
ruptions during the testing (despite your careful planning) , a record pk be 
made of the type and length of the interruption and in what manner, if any, It 


altered the administration of the test. 
' A record of unusual pupil behavior 
information for determining whether tes 


and significant events provides valuable 
t scores are representative of the pupils’ 


258 Using Standardized Tests 


best efforts and whether standard conditions have been maintained during test- 
ing. Questionable test scores should, of course, be rechecked—by administering 
a second form of the test. 


SCORING STANDARDIZED TESTS 


The aim in scoring standardized tests is to obtain accurate results as rapidly 
and economically as possible. Where a relatively small number of pupils are 
being tested, this aim can usually be satisfactorily achieved by hand-scoring 
methods, For larger groups, however, and especially for a school-wide testing 
program, the use of a scoring machine is generally more effective. 


Hand Scoring 


The scoring of standardized tests by hand is a routine clerical task. The pro- 
cedure consists mainly of comparing the pupils’ answers on the test booklet, 
or separate answer sheet, with the correct answers listed on a scoring key and 
counting the number correct. In some cases, such as where a correction formula 
is used, simple arithmetic skills are also involved, Despite the simplicity of 
hand-scoring procedures, the task is fraught with possibilities for error. Un- 
trained scorers make frequent errors in counting answers, following instruc- 
tions, using scoring guides, and the like. In a study of the scoring accuracy of 
51 third-grade and fourth-grade teachers, errors were found in 28 per cent of 
the standardized tests they had corrected.5 Such results are not unusual. 


Scoring errors can be reduced to a minimum by taking the following pre- 
cautions: 


2. During the regular scoring of test papers, have each paper rescored by a 
second person. If this is impractical, rescore a representative sample of the 
Papers, such as every fifth paper, to determine if further rescoring is needed. 

3. Have each person initial the papers he scores. Where the scorer can be 
identified, he is more apt to make a special effort to avoid errors. 

4. Rescore any paper which has scores that appear questionable; such as 


exceptionally high or low Scores, scores which deviate widely from the teacher's 
judgment, and the like. 


Accuracy cannot be overemphasized in the scoring of standardized tests. Even 
small scoring errors have a detrimental influence on the reliability and validity 


of the test results, Large errors may lead to educational decisions which are 
injurious to individuals or groups. 


Should Teachers Do the Scoring? The majority of schools still rely on 


° B. N. Phillips and G. R. Weathers, 


“Analysis of Errors Made in Scoring Standardized 
Tests,” 


” Educational and Psychological Measurement, 18, 563-567, Autumn, 1958. 


Procedures for Selecting and Using Tests 259 


the classroom teacher to score the standardized tests administered to his pupils. 
This is usually defended on the following grounds: (1) it is fast, accurate, and 
economical, (2) teachers will obtain insight into the types of errors most fre- 
quently made by their pupils, and (3) it stimulates more active interest in the 
use of the test results, These reasons are seldom valid, however. In the first 
place, trained clerks can generally score standardized tests faster, more accu- 
rately, and at lower cost than classroom teachers. It is false economy to have 
professionally trained personnel perform routine clerical tasks. Secondly, scor- 
ing objective answers on standardized tests is a mechanical procedure which 
provides no insight into the nature of the test items missed by pupils. Finally, 
teachers view the scoring of standardized tests as an onerous task. Whether par- 
ticipation in such a distasteful activity contributes to greater interest in the use 
of the test results is highly questionable. It is more apt to result in unfavorable 
attitudes toward the entire testing program. 

Where hand scoring is necessary, the most desirable procedure is to select 
and train clerical workers for the task. If they are unavailable and teachers 
must be used for the scoring, it is imperative that they be given training and 
Supervised practice. It should not be assumed that teachers are efficient at rou- 
tine clerical tasks merely because they are good teachers. 


Machine Scoring 


The most effective method of relieving teachers of the burdensome task of 
Scoring standardized tests is to use machine scoring. A number of electrical 
and electronic scoring machines are available for this purpose and their services 
can be obtained directly from most test publishers. 

The oldest and best known of the scoring machines is the International Test 
Scoring Machine developed by the International Business Machines Corporation. 
This machine requires a special answer sheet like the one presented in Figure 
13.1. Answers are recorded on the sheet by blackening the answer space with a 


Special soft pencil. Answer sheets are scored by placing them in the machine 
he score on a meter. The machine 


One at a time, pressing a lever, and reading t 
ne s ' t is conducted by the graphite in 


works on the principle that an electric curren 
the pencil marks. 

A newer IBM machine, the 
to use standard IBM cards as 
shown in Figure 13.1. The use of these 
faster scoring, but they also make it possi 
mechanically, In addition, the answer car à 
usable form for research purposes. They can be used i 
without the usual card punching operation. 

Electronic test processing equipment d 
combines a test scoring machine with an e 
Printer. This equipment automatically scores 
raw scores to various types of derived scores, 


Electronic Scoring Punch 9002, makes it possible 
answer forms. An example of such a card is 
answer cards not only provides for 
ble to tabulate and record the scores 
ds are easily stored and in readily 
n regular IBM equipment 


eveloped at the University of Iowa 
lectronic computer and a fast output 
the answer sheets, converts the 
and provides printed reports of 


260 Using Standardized Tests 


the results. Answer sheets are processed at a basic rate of 6,000 an hour. The 
use of this equipment is controlled by a nonprofit corporation, the Measurement 
Research Center, Iowa City, Iowa. The services of MRC are available to schools 
through test publishers. 


PART. 


INSTRUCTOR. 


GRADE OR CLASS. 


= 
4 
s 
` 
Š 
w 
z 
š 


NAME OF TEST. 


TURN CARD OVER FOR TESTS 3 AND 4. 


put oF 
TEST 
oar 
OCCUPATION  DRT 
EXAMINER 


city, 
GRADE OF 


Figure 13.1. Machine- 


I scoring answer forms. (Courtesy of International Business Machines 
Corporation. Cal-Card 


@ 1961 by California Test Bureau. Used by permission.) 


The main advantage of machine scorin 


g over hand scoring is that it is more 
accurate. 


In addition, machine scoring is relatively inexpensive. Test scores can 
be obtained from test publishers at a cost of several cents each, and additional 
Statistical and reporting services are usually available at a similarly low cost- 
The major disadvantage is that, despite the speed of the machine, it frequently 
takes several weeks or more to obtain the results from test publishers, This is not 


a serious limitation, however, and one that can be offset by careful scheduling 
of the testing dates, 


Procedures for Selecting and Using Tests 261 
REPORTING AND RECORDING TEST RESULTS 


When standardized tests have been scored, a copy of the results should be 
sent to the teachers concerned as rapidly as possible. Unless test results are 
made available to teachers quickly, much of their instructional value is lost. 
This is an especially significant factor where fall testing is used and the results 
are needed for instructional planning. 

In addition to reporting the test res 
be entered on the cumulative record 


ults directly to teachers, test scores should 
form of each pupil. This should include 
all of the necessary information for interpreting and using the test results effec- 
tively. As a minimum, the following information should be indicated: (1) date 
the test was given, (2) grade in which the test was given, (3) test title and 
form used, (4) total and part scores, expressed in appropriate units, and (5) 
rcentile or standard scores are used. 

d other evaluative data, it is essential that 
the pupils’ cumulative records be kept up to date and easily accessible. Test 
scores soon gather dust where school policy, or the record system itself, pro- 
vides barriers to their use. All teachers should have free and ready access to 
the pupils’ records and should be encouraged to use them as needed. 


OF STANDARDIZED TESTS 


nature of norm group, where pe 
For effective use of test results an 


USING THE RESULTS 


The administration and scoring of 
seated procedures requiring rigid 
nterpretation and use of test results, 


rigid set of rules, These functions require so 
broad training and experience. In the following chapter, we shall present the 


technical considerations in interpreting test scores and norms. Here the dis- 
cussion will be limited to some of the possible uses and misuses of standardized 


test results. 


group standardized tests are rather me- 
adherence to fixed rules and directions. 
however, cannot be reduced to any such 
und professional judgment based on 


Possible Uses of Standardized Tests 


If they are carefully selected and used with discretion, standardized tests of 


achievement and scholastic aptitude can play an important role in the evaluation 
they should be selected in accordance 


Program of the school. As noted earlier, 1 
with the purposes for which the results are to be used. Effective use requires 
the recognition that standardized tests provide only partial information in a 
comprehensive evaluation program. This information must be supplemented 
by the results of classroom tests, teacher observations, performance evaluations, 


and other types of evaluative data. ° j 
All too frequently, standardized tests are permitted a dominate and modify 
the educational program of the school undesirably. This is most apt to occur 
where little attention is paid to objectives and where there is a feeling that 
tant outcomes of educa- 


standardized tests must—somehow—measure the import ; 
tion, The result, of course, is that the school curriculum is shaped by test 
> 3 


Publishers rather than by local school personnel. Proper use of standardized 


262 Using Standardized Tests 


tests demands that we view them as tools which contribute to the achievement | 
of our educational objectives by providing us with the information we desire 
concerning our pupils. When viewed in this manner, they can contribute to more 
effective educational decisions in a number of areas. . 

Curriculum Planning. Standardized tests of achievement and scholastic 
aptitude can aid curriculum planning in a particular school or course in the 
following ways: 

1. Identifying the level and range of ability among pupils. Among pod 
things, curriculum plans must take into account the learning potential of co 
pupils and their present levels of achievement. Standardized tests provide 
objective evidence on both of these points, d 

2. Identifying areas of instruction needing greater emphasis. If the standar x 
ized tests have been carefully selected in terms of the objectives of the school, 


however, the results of 
of achievement, 


3. Evaluating experimental programs of instruction. Where changes in cur- 
Ticulum content or instructional procedures are being introduced on an % 
perimental basis, standardized tests of scholastic aptitude can be used to match 
groups on learning Potential and standardized tests of achievement can gone 
tribute to a more objective comparison of the new and old methods. The lim- 


. : . . i o 
ited objectives measured by standardized tests must, of course, be taken int 
account in interpreting the results. 


4. Clarifying and selectin 
the clarification and selecti 


of selecting standardized tests forces us to identify and state our objectives as 
clearly as possible. Second, 


à; = z S + ts 
an examination of the items in standardized an 
clarifies to us how different objectives function in testing, As we go back an 


forth (during test selection) between our objectives and the items in the nee 
under consideration, our notion of what a particular objective looks like in 
Operational terms becomes increasingly clear. Third, results of standardized tests 
provide evidence which helps in selecting objectives at a particular grade e 
or in a particular course. If fall test results indicate that pupils in the pa 
ills, for example, the teacher may want to include 
en though he had not originally planned to devote 
ls. Similarly, standardized test results might indicate 


š A ; rily 
1s unnecessary because it has been satisfactorily 
level. 


em too heavily. Even the 
ment tests me. 


riculum. 


: shieve- 
most comprehensive battery of achiev 
. diag y; 
asures only a portion of the objectives of a well-balanced cu 


= f 
Sectioning and Grouping Pupils. One of the most common uses 0 


Procedures for Selecting and Using Tests 263 


standardized test results is that of organizing groups which are similar in 
terms of learning ability. This ability grouping is frequently used in large 
schools when it is necessary to form several classes at each grade level or sev- 
eral sections of the same course. It is also used widely by teachers in forming 
instructional groups within classes. Elementary teachers, for example, com- 
monly form reading groups and arithmetic groups on the basis of pupil ability. 

Forming school or classroom groups according to ability is frequently re- 
ferred to as homogereous grouping. This is somewhat of a misnomer, however, 
since pupils grouped in terms of one type of ability are apt to vary considerably 
in other abilities, in interests, in attitudes, and in other characteristics significant 
from a learning standpoint. Ability grouping is a much more suitable term 
but even here there is the danger of assuming that the pupils in a given group 
will be identical in ability. All we can hope for is that the range of ability 
in the group will be smaller than it would be without such grouping. 

Although ability grouping is practiced widely, there is still considerable 
criticism of its use. Two of the most common objections are: (1) it does not 
adequately provide for individual differences, and (2) a stigma is attached to 
Pupils in the slowest groups. The first objection is a valid one which has helped 
clarify the role of ability grouping. It is becoming increasingly clear that ability 
grouping, by itself, is inadequate in coping with individual differences. It merely 
reduces the range of ability in a group so that individualized instruction can 
be applied more effectively. The second objection describes a real danger that 
is difficult to avoid. Separate grouping in each subject has helped. Where a 
Pupil can be in a slow group in English, an average group in social studies and 
Science, and an accelerated group in mathematics, there is less apt to be a 
Stigma attached to those in the slow group. Flexible grouping which permits 
Pupils to shift from one group to another, as performance improves, also helps 
in counteracting undesirable attitudes toward pupils in the atest eit s 

Ability groups can be formed on the basis of scholastic aptitude or achteve- 
ment test scores. À more desirable practice is to use both. Borderline cases 
are more eflectively placed when both types of test scores are used. If a pupil's 
achievement test score is near the cutoff point for the accelerated group, for 
example, he could be placed in the regular or accelerated group on the basis 
of how his scholastic aptitude test score compared with those to be nie 
each group. Using both types of test results will generally result in gron ma 
are more alike in learning ability. In forming instructional groups within the 
classroom, standardized test results are generally used for preliminary grouping 
only, Later adjustments in grouping depend more heavily on the teachers’ day- 
to-day evaluations of the pupils’ learning progress. ~~ i 

Individualizing Instruction. Despite the method of sectioning ve or 
8touping pupils within classes, individual differences in aptitude and achieve- 
ment will still exist among the pupils in any given group. Thus it is necessary 
to study the strengths and weaknesses of each pupil in class so that instruction 
can be adapted, as much as possible, to their individual learning needs. For 
this purpose (1) scholastic aptitude tests provide clues concerning learning 


264 Using Standardized Tests 


potential, (2) reading tests indicate the difficulty of material the pupil can read 
with comprehension, (3) achievement tests point out general areas of strength 
and weakness, and (4) diagnostic tests pinpoint the particular errors of learn- 
ing that are handicapping the pupil. 

The use of tests to diagnose in order to remedy learning difficulties is one 
of the most common ways of using test results to individualize instruction. At 
the elementary school level, a large fraction of teaching time is devoted to 
helping individual pupils identify and correct their errors in reading, arith- 
metic, spelling, and writing. At the high school level, considerable time is also 
devoted to the specific learning difficulties of individual pupils, both in general 
skill areas, like that of reading, and in the understanding of specific subject 
matter content. Published diagnostic tests are especially useful in diagnosis 
because they provide a systematic approach to the identification of learning 
errors. Since these tests are limited almost entirely to the areas of reading and 
arithmetic, however, it is frequently necessary to use general achievement tests 
for diagnostic purposes. When this is done, an analysis of a pupil’s responses 
to the individual test items will provide clues to his learning difficulties. 

Identifying Pupils Needing Special Help. Some pupils deviate so markedly 
from the normal group of pupils at their grade or age level that special treat- 
ment, beyond that which can be provided by the classroom teacher, is needed. 
The extremely gifted, the mentally retarded, the very poor reader, the emo- 
tionally disturbed, and similar exceptional children fall into this category. For 
this type of pupil, standardized testing serves as a screening process; it identi- 
fies those pupils requiring further stud 
for meeting their exceptional needs. 

Evaluating the Educational Development of Pupils. Standardized tests 
are especially useful in measuring the learning progress of pupils over extended 
periods of time. Equivalent forms of the same test can be used in the fall and 
spring to measure progress during the school year. Where achievement test 
batteries are administered annually, a long-range picture of the pupils’ educa- 
tional development can be obtained, Standardized tests are better suited to this 
type of measurement than teacher-made tests because the measurements are more 
comparable from one time to the next. 

In using standardized tests as one basis for determining the educational de- 
velopment of pupils, care must be taken not to overgeneralize from the results. 
Tests which provide somewhat comparable measures of general educational 
development throughout the school years must, of necessity, be confined to those 
learnings which are in the process of continuous development and which are 
common to diverse curricula. In the main, this means the basic skills and those 
critical abilities used in the interpretation and application of ideas which cut 
across subject-matter lines. Although these are extremely significant educational 
outcomes, they provide only a partial picture of total educational development. 
Knowledge and understanding of specific subject-matter content, skills which 
are unique to each subject field, attitudes and appreciations, and similar learn- 


y so that special provisions can be made 


Procedures for Selecting and Using Tests 265 


ing outcomes that cannot be measured by standardized tests of educational 
development, are equally important. 

Helping Pupils Make Educational and Vocational Choices. At the 
high school level, standardized test results can contribute to more intelligent 
educational and vocational decisions. In deciding which curriculum to pursue, 
which specific courses to take, whether to plan for college, or which occupations 
to consider, the pupil can be aided greatly by knowledge of his aptitudes and 
his strengths and weaknesses in achievement. Standardized tests are especially 
useful in educational and vocational planning because they indicate to the pupil 
how he compares with persons beyond the local school situation. This is im- 
portant because these are the persons with whom he will be competing when 
he leaves high school, whether he goes to college or directly into an occupation. 

Supplementary Uses. In addition to the above uses of standardized tests, 
all of which are directly concerned with improving the instruction and guidance 
of pupils, there are a number of supplementary uses to which they can be put. 
These include such things as: (1) appraising the general effectiveness of the 
school program in developing basic skills, (2) identifying areas in the educa- 
tional program where supervisory aid and in-service training can be used most 
effectively, (3) providing evidence for interpreting the school program to the 
public, and (4) providing information for reports to other schools, colleges, 
and prospective employers. When standardized test results are presented to 
individuals and groups outside the school, it should be emphasized that these 
tests measure only a limited number of the objectives of the school program. 


Misuses of Standardized Tests 

can be misused in any of the areas dis- 
cussed above, if: (1) there is inadequate attention to the educational objectives 
being measured, (2) there is a failure to recognize the limited role the tests 
play in the total evaluation program, or (3) there is unquestioning faith in the 
test results. These factors contribute to the misapplication and misinterpretation 
of standardized test results in any situation. In addition, there are two specific 


misuses that warrant special attention. 
Assigning Course Grades. Some 


As noted earlier, standardized tests 


teachers use standardized test scores as 


a basis for assigning course grades. This is undesirable for at least two reasons: 
(1) standardized tests are seldom closely related to the instructional objectives 
of a particular course, and (2) they measure only a portion of ee 
learning outcomes emphasized in instruction. The practice of using Mi j ize 

tests for grading purposes tends to place too much emphasis on 5 ran pe 
ber of ill-fitting objectives. In addition to the unfairness to t a an is 
encourages both teachers and pupils to neglect those objectives which are not 


measured by the tests. 

In borderline cases, especially whe 
standardized test results can serve a va 
a pupil’s scholastic aptitude and his genera 


re promotion or retention is being decided, 
luable supplementary function. Knowing 
] level of educational development 


266 Using Standardized Tests 


contributes to a more intelligent decision concerning his best grade gee 
Except for such special uses, however, standardized tests should play a mino 
le in determining course grades. . 
7“ Evdlnaring Teaching a In some schools a teacher’s — 
ness is judged by the scores his pupils make on a standardized test. This a 
extremely unfair practice since there are so many factors, other than teac : : 
effectiveness, which influence test scores. Many of these, such as the leve Y: 
ability of the class, cultural background of the pupils, previous es 
experiences of the pupils, and the relative difficulty of learning different R 
materials, cannot be controlled or equated with sufficient accuracy to ges 
inferring that the results derived solely from the teacher’s efforts. Even if sug 
factors could be controlled, standardized tests would serve as a poor criterion 
of teaching success because of the two factors mentioned earlier—they em 
closely related to the instructional objectives of particular courses, and they 


; ; ; in their 
measure only a portion of the learning outcomes teachers strive for in 
instruction. 


THE SCHOOL-WIDE TESTING PROGRAM 


The contribution of standardized testing to the total educational program p 
the school is substantially increased when the tests are selected and used x 
conjunction with a testing program organized on a school-wide basis. Ther! 
are several advantages to a carefully planned, systematic program. First, plan 
ning the program requires reviewing the instructional goals of the entire aeron 
to determine the need for evaluation instruments in general and for standardize 
tests in particular. This process clarifies the role of standardized testing in the 
instructional and evaluation programs of the school and leads to more efiective 
interpretation and use of the results. Second, a planned program provides for 
selection of tests that serve a variety of uses. Although as teachers we ae 
primarily interested in using standardized tests to improve pupil learning, ihe 
fully selected tests can also serve administrative and guidance functions In I á 
school. An organized program inereases the likelihood that these needs will als 
be considered during test selection. Third, a planned + 
that are comparable from one year to the next. This obviously gives us a oad 
picture of the pupils’ educational development than that obtained by sporadi 
testing with a variety of different tests, Fourth, a p 
possible to accumulate test results and other eval 
A systematic record contributes to more meaningful 
of test scores since it encourages examination of res 
test scores and other pertinent accumulated data. 


: ults 
program provides res 


lanned program makes 1 
uative data systematically- 
and rational interpretation 
ults in relation to previous 


A Minimum Program 


Since the testing 
of instructional go. 
results, it is. impos 


: s 
program in any particular school must be developed in se 
als of the school and the specific uses to be made of a’ 
sible to present a detailed testing program that would ha 


Procedures for Selecting and Using Tests 267 


general application. Each program should be tailor-made to the needs of the 
school; this requires cooperative study and planning by teachers, guidance per- 
sonnel, and administrators. About the only contribution we can make to this 
undertaking is to suggest tests and procedures which should be considered in 
planning a minimum program. 

Scholastic Aptitude Testing. Periodic measures of scholastic aptitude play 
ogram. The first test is generally given at the 
end of the kindergarten year or at the beginning of the first grade. This should 
be followed up by a test every several years. Selection of the specific grades in 
which scholastic aptitude tests are to be given must take into account (1) the 
way the school is organized, (2) whether the tests are to be given in the fall 
or spring, and (3) the use to be made of the results. Key points in scheduling the 
tests are those where the pupil moves from one educational level to another. 
Thus, in a school organized on a 6-3-3 plan, where spring testing is used, 


scholastic aptitude tests might be given in the kindergarten, third, sixth, ninth, 
and eleventh grades. If fall testing is used, a sequence like the first, fourth, 
de similar results. 


seventh, tenth, and twelfth grades would provi s ; 
For some purposes it might be desirable to use a reading-readiness test at 


the first-grade level and a differential aptitude test at the tenth-grade level. 
Where this is done, the first-grade and tenth-grade scholastic aptitude tests 
could be dropped from the schedule, since both of the above tests provide satis- 
factory estimates of scholastic aptitude. Decisions like this must, of course, 
depend on the use to be made of the results. There is little advantage in using 
a reading-readiness test unless reading instruction is individualized, or in using 
a differential aptitude test unless arrangements are made for interpreting and 


using the results in the guidance of pupils. 


Grou i itude are sal 
p tests of scholastic aptitude a : rn 
but low scores and those which deviate considerably from teachers’ judgments 


should be checked by an individual test. If no one in the school is qualified to 
administer individual tests, doubtful scores can be checked by eee 
another form of the same group test. When used for this purpose, the group 
test should be administered to a relatively small number of pupils at one time 
and the i , observed during t 

he ae a bare minimum, standardized achievement tests 
should be administered at the same time that scholastic aptitude i are 
administered, starting at about the third- or fourth-grade level. “This will = 
vide evidence of both aptitude and achievement at each of os mae pe iy 
Points in the pupils’ school life. There is an pea — sk pA 
tering scholastic aptitude tests and achievement tests m t sie g | aa 
the same time of the year. Through the use of comparable i at oca, 
norms, it is possible to compare the test results and determine the extent to 
which pupils are achieving up to their potential. ; ; 

A typical minimum testing program which takes into account our suggestions 
for testing at transitional points is presented in Table 13.2. It should be noted 
that it is based on a school using a 6-3-3. organizational plan and fall testing. 


a significant role in any testing pr 


tisfactory for the minimum program 


he testing. 


268 Using Standardized Tests 


A school organized differently would, of course, have different transitional nois 
and the program would be modified accordingly. As noted earlier, spring tesi 


ing would also cause a shift in the schedule. In general, all tests would be given 
one year earlier. 


Table 13.2 


A TYPICAL MINIMUM TESTING PROGRAM * 


Type of Test 


Reading Scholastic Differential Achievement 
Grade Readiness Aptitude Aptitude Battery 

fl Xt X 

2 

3 

4 x k 

5 

6 

7 x x 

8 

9 
10 x Xt x 
11 
12 x x 


* For school using 6-3-3 plan and fall testing. 
T Could substitute for scholastic aptitude tests at that grade level. 


Achievement test batteries used in a minimum testing program should gem 
erally be limited to the basic skills at the elementary school level and to nee 
ures of general educational development at the high school level. This provide 
a standard measure of those outcomes which for à 
riculum. More specific instructional outcomes in such content areas as science; 


É š 3 š with 
social studies, and literature can usually be measured more effectively 
locally constructed achievement tests. 

Where possible 


administer achiev 


cur 
m the common core of a 


» it is desirable to go beyond the minimum program mi 
ement tests annually. This provides all teachers with an E 
to-date measure of each pupil’s strengths and weaknesses. In addition ps 
improving the immediate instructional value of the results, annual testing P” 


A Š , š ils’ genera 
vides a more comprehensive and meaningful picture of the pupils’ ge 
educational progress, 


Diagnostic Testing. 
and regularly administer 
a good minimum progra 


It is not expected that diagnostic tests will be routinely 
ed as are other tests in the testing program. s nn 
m will provide for the use of diagnostic tests by In a 
vidual teachers, as needed. Where general achievement tests uncover ee 
in the learning of basic skills, for example, a teacher should feel free eset 

the use of a diagnostic test with those pupils having the learning difficu 
During the planning of a minimum program, this type of testing must be co 


Procedures for Selecting and Using Tests 269 


sidered in order to assure that the necessary funds are available and that diag- 
nostic testing is included in the in-service training program for teachers. 

Self-Report Techniques. If adequate guidance facilities are available in 
the school, so that the results can be interpreted and used effectively, interest 
might be added to the minimum program. A gen- 
eral interest inventory administered once or twice at the high school level is 
useful in educational and vocational planning. An adjustment inventory, or 
problem checklist, administered on a school-wide basis at the high school level, 
can serve as a screening device for locating pupils with adjustment problems 
serious enough to require further study and counseling. 

If used, interest and adjustment measures should be scheduled at the grade 
level where the results are most useful in pupil guidance. Since the tenth and 
twelfth grades are critical points in educational and vocational planning, the 
interest inventory could be administered at the beginning of these grade levels. 
The adjustment inventory, or problem checklist, is probably most helpful if 
administered to pupils early in their high school careers. An important con- 
sideration in scheduling the measure of adjustment is availability of counseling 
time. Its use as a mental health screening device assumes that counselors have 
sufficient time to provide the necessary follow-up and counseling services. 


and adjustment inventories 


Selecting the Time of Year for Testing 
ions in planning a school testing program, 


In addition to the other considerat I 
best time of the year to give the tests. 


a decision must be made concerning the hasis i luati 
° r t valuatin; 
Spring testing is most frequently favored where the emphasis 1s on © ë 


the effectiveness of the school program and where the results are to be used 
in sectioning classes for the following year. For most instructional purposes, 
however, spring testing leaves much to be desired. The results are obtained too 
late in the school year for classroom planning and too late to correct learning 
weaknesses revealed by the tests. Fall testing provides teachers with a 
information for planning the year’s work, for grouping within the class, y 
for guiding and directing the work of individual pupils. Fall testing also avoi : 
the two most common misuses of standardized testing—grading pupils an 


evaluating teachers. 


A Note of Caution 
š ; i 1l 
The school-wi p f standardized testing constitutes only a sma 
coo] Wide progran the school. To measure adequately the 


Part of the ion program oÍ š at 
diverse eam eat to serve the many uses for which pe i 
information is needed a the school, a variety of evaluation methods is required. 
Teacher-made achievement tests, anecdotal records, performance pth 
Metric techniques, and similar informal methods of appraisal both supp ement 

the school-wide testing pro- 


and complement standardized test results. Ideally, i 
gram should be planned in conjunction with the total evaluation program of 


the school so that the role of standardized testing is viewed in proper perspective. 


270 Using Standardized Tests 
SUMMARY 


Teachers should be intimately familiar with the procedures for locating, $e- 
lecting, administering, scoring, and using standardized tests. This enables them 
to participate in the school testing program more effectively, to select tests for 
particular uses more wisely, and to apply test results to instructional problems 
more intelligently. , 

The two basic guides for locating information- about specific tests are Burs 
Tests in Print and his Mental Measurements Yearbooks. Supplementary infor- 
mation may be obtained from test publishers’ catalogues, the Review of Educa- 
tional Research, and various professional journals containing articles on testing. 

To provide greater assurance that the most appropriate tests are selected, # 
Systematic selection procedure should be followed. This includes: (1) donam 
the specific needs for evaluation data, (2) appraising the role of standardize 
testing in relation to other evaluation procedures and to the practical Beate 
of the school situation, (3) locating suitable tests through Buros’ guides a 
test publishers’ catalogues, (4) obtaining specimen sets of those tests whi 
seem most appropriate, (5) reviewing the test materials in light of their " 
tended uses, and (6) summarizing the data and making a final selection. 
summary of data concerning each test under consideration is simplified ia 
standard evaluation form is used during the information gathering period i 
the selection process. a 

Administration of standardized tests involves careful preparation earn 
and strict adherence to standard procedures during testing. In preparing 5 
testing, the materials should be ordered and checked well in advance, a sita e 
location for testing should be selected, provisions should be made to aa 
distractions, and practice should be obtained in administering the test. purs 
administration, the directions and time limits must be rigidly followed. yane 
the limits of the standard conditions, the pupils should be motivated to do He 
best. Also, a record should be made of any event during the testing ed 
which might have an influence on the test scores. Questionable scores shou 
be checked by administering a second form of the test. ñ 

Scoring of standardized tests is a mechanical procedure which places a pre 
mium on accuracy, speed, and economy. Hand scoring is satisfactory where = 
relatively small number of pupils is involved. Where hand scoring is necessary 
or desired, trained clerks should be used rather than teachers. When a large 
number of tests are to be scored, there is no substitute for machine scoring; 
it is accurate, fast, and relatively inexpensive. Machine scoring services can be 
obtained directly from most test publishers. 

When tests have been scored, results should be made immediately available 
to the teachers concerned. They should also be recorded in the pupils’ cumula- 
tive records for future use by all school personnel. 

Standardized test results can serve a number of useful purposes in the school. 
They can aid in curriculum planning, sectioning and grouping of pupils, indi- 
vidualizing instruction, identifying pupils needing special help, evaluating the 


Procedures for Selecting and Using Tests 271 


educational development of pupils, educational and vocational planning, and 
appraising and reporting on the effectiveness of the school program. In general 
they should not be used as a basis for assigning course grades or evaluating 
teaching effectiveness. Standardized tests are not closely enough related to the 
goals of particular courses and they measure too limited a sampling of instruc- 
tional objectives to be useful for these purposes. 

A formal school-wide testing program provi 
standardized tests through improved planning, better test selection, greater 
comparability of test results, and more systematic procedures for recording 
results, A minimum program should include measures of scholastic aptitude 
and achievement at transitional points in the school life of pupils and supple- 


mentary testing as needed. 


des for more effective use of 


SUGGESTIONS FOR FURTHER READING 

Anastasi, Anne. Psychological Testing. 2nd edition, New York: Macmillan, 1961. Chapter 3: 
“Uses of Psychological Tests.” 

Boston: Houghton Mifflin, 1963. 


Bauernfeind, R. H. Building a School Testing Program. 
8: “Testing for Educational Development, 


Chapter 7: “Testing in Grades K-3.” Chapter 3 z 
Grades 3-12.” Chapter 9: “Testing for Mental Ability, Grades 3-12. Chapter 10: “Test- 
ing for Vocational Aptitudes.” Chapter 13: “Using Subject-Matter Achievement Tests.” 
Chapter 14: “Planning the Master Testing Program.” 
Cronbach, L. J. Essentials of Psychological Testing. New 
ter 3: “Administering Tests.” 
Davis, F. B. Educational Measurem 
sa Publishing Co., 1964. Chapter 3: 
ests.” Chapter 4: “Test Scoring.” a . 
Ebel, R. L. “Eight Critical Questions About the Use of Tests, Education, 81, 67-99, 1960. 
Findley, W. G. (ed.) The Impact and Improvement of School Testing Programs, Sixty-second 
Yearbook ional Society for the Study of Education, Part II. Chicago: The 
rbook of the National $o0 ndley, W. G., “Purposes of School Test- 


University of Chicago Press, 1963. Chapter 1: Fi 
ing Programs and Their Efficient Development.” Chapter 10: Traxler, A. E., and North, 


R. D., “The Selection and Use of Tests in a School Testing Program.” 


York: Harper & Row, 1960. Chap- 


tation. Belmont, California: Wads- 


rents and Their Interpre 
d Administration of Standardized 


“Selection an 


Test Bulletins 
Principles and Procedures. Evaluation and Ad- 


Katz, M. Select Achievement Test: 
visory Sr ores No. 3, Princeton, New Jersey: Educational Testing Service, 1961. 
Lennon, R. T. Testing in the Secondary School. Test Service Notebook, No. 20, New York: 
Harcourt, Brace & World, 1961. 


Chapter H! 
interpreting 
test scores 


Test performance is most meaningful when some basis for comparison ve 
available. . . . Norms provide clearly defined reference groups for thi 
purpose. . . . Interpreting test scores with the aid of norms requires (1) 
an understanding of the various methods of expressing the scores, s: 
(2) the ability to judge what the norm group actually represents. . - + *” 


oP . . i ita: 
addition, proper interpretation calls for a frank awareness of the limt 
tions of all test scores. 


Test interpretation would be greatly simplified if we could express test sea 
on scales like those used in physical measurement. We know, for example, tha 
5 ft. means the same height whether we are talking about the height of a boy 
or a picket fence; that a 200-lb. football player weighs exactly twice as muc 
as a 100-lb. cheerleader, and that 8 min. is exactly one third as long as 24 uae 
whether we are timing a standardized test or a basketball game. This ability t° 
compare measurements from one situation to another, and to speak in terms w: 
“twice as much as” or “one third as long as,” is made possible by the tac 
that these physical measures are based on scales which have a true zero point 
and equal units. The true zero point (e.g., the point at which there is “no 
height at all” or “no weight at all”) indicates precisely where measurement 
begins and the equal units (e.g., feet, pounds, and minutes) provide uniform 
meaning from one situation to another and from one part of the scale to a? 
other. Ten pounds indicates the same weight to the doctor, the grocer, the 
farmer, and the housewife. Also, the difference between 15 and 25 lb. repre- 
sents exactly the same difference as that between 160 and 170 1b. 

Unfortunately, the properties of physical measuring scales, with which we 


272 


Interpreting Test Scores and Norms 2⁄73 


are all so familiar, are generally lacking in educational measurement. À pupil 
who receives a score of zero on a history test does not have zero knowledge of 
history. There are probably a large number of simple questions he could answer 
which were not included in the test. À true zero point in achievement, where 
there is “no achievement at all,” cannot be clearly established. Even if it could, 
it would be impractical to start from that point each time we tested. What we 
do in actual practice is to assume a certain amount of basic knowledge and 
measure from there. This arbitrary starting point, however, prevents us from 
saying that a test score of zero indicates “no achievement at all,” or that a 
score of 100 represents twice the achievement of a score of 50. Since we are 
never certain how far the zero score on our test is from the true zero point 
(i.e., the point of “no achievement at all”), test scores must be interpreted in 
relative rather than absolute terms. We can speak of “more” or “less” of a 
given characteristic but not “twice as much as” or “half as much as.” 

The interpretation of test results is additionally handicapped by the inequality 
of our units of measurement. Sixty items correct on a simple vocabulary test 
does not have the same meaning as sixty items correct on a more difficult one; 
nor do either of the scores represent the same level of achievement as sixty 
itms correct on a test of arithmetic, science, or study skills. Our test items 
simply do not represent equal units like those of feet, pounds, and nent ter 

To overcome the lack of a definite frame of reference in educationa meas- 
urement, various methods of expressing test scores have been devised. a 
shall see shortly, the methods vary considerably in the extent to which they 
provide units which have uniform meaning from one measurement to another. 
Much of our difficulty in the interpretation and use of test results arises from 
the fact that we have so many different scoring systems—each with its own 


peculiar characteristics and limitations. 


TYPES OF TEST SCORES AND NORMS 


Raw Scores 
to 65 items on an objective test in which each 


is raw score will be 65. Thus, a raw score is 
d on a test when the test has been scored 
e any difference whether each item 
d in some way, or a correction 


If a pupil responds correctly 
correct item counts one point, h 
simply the number of points receive 
according to the directions. It does not one 
is coun int or the items are wel i 
for a es as the resulting point score is known is a a score. We 
are all familiar with raw scores from our many years of taking c aap tests. 

Although a raw score provides a numerical summary of ai A per- 
formance, it is uninterpretable without further information. etting 65 items 
right on an arithmetic test, for example, has no particular meaning in itself. 
We cannot determine whether it is high or low unless we have some basis for 
comparison. When using informal classroom tests, we generally compare 2 
score to (1) the total number of items in the test, or (2) the scores obtained 


by the pupil’s classmates. 


274 Using Standardized Tests 


d 
items answered correctly. Our score of 65 in arithmetic, for eaga mo 
be high if there were only 70 items in the test (93 per cent correct) to be 
if there were 130 items (50 Per cent correct). At first glance, this aE) mas 
a useful method for interpreting test scores. It is of little value in all ee 
tery tests, however, where we are interested in pupils obtaining peeps of 
and the “percentage correct” indicates how far a pupil is from our g 


- uch abso- 
complete mastery. In general achievement tests, where we have no s 
lute standard of mastery, “pe 


in constructing these tests is 
ficulty in order to obtain 


jm 
Tcentage correct” is rather meaningless. T aif 
to obtain items near the 50 per cent level $ it 
maximum discrimination between high an ane 
test items pupils answer correctly, then, is as fai 
ful we have been in controlling the difficulty ante 
level of performance. Teachers who use “perc their 
des and to set an arbitrary passing score on able, 
recision in test construction which is unattain 
ned test specialists. d with 
room tests are most meaningful when angen test 
in the same classroom group. By pahe te 
Scores in rank order, for example, we can determine whether a particu i ss. By 
is third from the top, about average, or one of the lowest scores in cla 
interpreting a pupil’s test score in ter 
cumvent the problems arising 
The fact that a test is relativ 


correct” to assign letter gra 
classroom tests assume a p 
even by the most highly trai 

Scores on informal class 
the scores of other pupils 


Derived Scores 


eir 
While raw scores can be: used directly for some classroom a aman M 
use is restricted in two important ways. First, since a raw score is <n, To 
by itself, it is difficult to interpret and use beyond the immediate situa ied by 
make a raw score of 80 interpretable, for example, it must be er ane of 
information concerning the nature of the group tested and the distribu are 
scores obtained. Second, raw scores on different tests cannot E s. Tor 
directly, Ifa pupil obtains a raw score of 55 in spelling and 35 ca de He 
example, we have no basis for determining oe which subject he a s inferior 
achievement. Spelling 55 words correctly might = a ee be said 
achievement depending on the difficulty of the a ae ety aera on 
concerning the solving of 35 arithmetic problems. To Compare: ae 


Interpreting Test Scores and Norms 275 


different tests, we need a unit of measurement which has fairly uniform mean- 
ing from one test to another and from one part of the scale to another (like 
those used in physical measurement). 

Derived scores provide units which approximate the uniformity we desire in 
test scores, A derived score is a numerical report of test performance in terms 
of the pupil’s relative position in a clearly defined reference group. Converting 
raw scores to derived scores is simple with standardized tests. All we need to do 
is consult the table of norms in the test manual and select the derived score 
which corresponds to the pupil’s raw score (some derived scores are so easily 
calculated that we can also develop local norms, if desired). 

The most common types of derived scores are grade equivalents, age equiva- 
and standard scores. The first two describe test per- 
h a pupil’s raw score is just average. 
in a particular group in 
ber. The specific meaning 


lents, percentile ranks, 
formance in terms of the group in whic 
The last two indicate the pupil’s relative standing 
which he is a member, or desires to become a mem 
of each type of score is given in Table 14.1. 


Norms 

Tables of norms included in test manuals merely present scores earned, by 
pupils in clearly defined reference groups: The raw scores and derived scores 
are presented in parallel columns so that the conversion to derived scores is 
easily made. These scores do not represent especially good or desirable per- 
formance, but rather “normal” or “typical” performance. They were obtained 
at the time the test was standardized by administering the test to representative 
groups of pupils for whom the test was constructed. Thus, they indicate the 
typical performance of pupils in these standardization groups and nothing more. 
They should not be viewed as standards, or goals, to be achieved by other 


pupils. 


Test norms enable us to answer questions such as the following: 


l. How does a pupil’s test performance compare with that of other pupils 


with whom we wish to compare him? Lome 
2. How does a pupil’s performance on one test (or subtest) compare with his 

? 

performance on another test (or subtest) ? 
3. How does a pupil’s performance on one fo) 1 

performance on oe form of the test, administered at an earlier date? 
ns of test scores make it possible to predict a pupil’s prob- 
to diagnose his strengths and weaknesses, to 
h, and to use the test results for various other 
instructional: and guidance purposes. Such functions of test scores would be 

M: o . . 

d without the use of the derived scores provided by test norms. 
n types oÍ test norms! is presented in Table 


form of a test compare with his 


These compariso: 
able success in various areas, 
measure his educational grow? 


severely curtaile 
A summary of the most commo 


nd E. Hagen, Measurement and Evaluation in Psychology and Edu- 


IR. L. T dike a 
i e op n Wiley & Sons, 1961). 


cation (New York: Joh 


276 Using Standardized Tests 


14.1. To interpret and use test results effectively, we need a good grasp of he 
characteristics, advantages, and limitations of each of these types of norms. 
Therefore each is described in considerable detail in the following pages. 


Table 14.1 
MOST COMMON TYPES OF TEST NORMS 
Type of Test Name of Derived Meaning in Terms of 
Norm Score Test Performance - 
” r ore 15 
Grade norms Grade equivalents Grade group in which pupil's raw se 
average. ez core is 
Age norms Age equivalents Age group in which pupil’s raw § 
average. ce 
š ' " ren 
Percentile norms Percentile ranks Percentage of pupils in the — 
(or percentile scores) group who fall below pupil's raw ee or 
Standard score Standard scores 


Distance of pupil’s raw score a a 
below the mean of the reference er 
terms of standard deviation units. 


up in 
norms 


GRADE NORMS 


Grade norms are widely used with standardized achievement tests, especially 
at the elementary school level. They are based on the average scores earne Ee, 
pupils in each of a series of grades and are interpreted in terms of grade ST 
alents. For example, if pupils in the standardization group who are beginn! f 
the fifth grade earn an average raw score of 24, this score is assigned a ee 
equivalent of 5.0. Tables of grade norms are made up of such pairs of T 
scores and their corresponding grade equivalents, ; ar 

Grade equivalents are expressed in two numbers; the first indicates the r 
and the second the month. Grade equivalents for the fifth grade, for examp ai 
range from 5.0 to 5.9. This division of the calendar year into tenths assu™! x 
little or no change in test performance during the summer vacation mon a 
To convert to grade equivalents with a table of grade norms, all one needs 
do is locate in the table the pupil’s raw scores and read off the corresponding 
grade equivalents. Grade norms are presented in Table 14.3. a 

It should be especially noted that grade norms indicate the average ig 
formance of pupils at various grade levels. For any particular grade equivalen H 
50 per cent of the pupils in the standardization group are above this norm P 
50 per cent are below. Consequently, we should not interpret a particular gra i 
norm as something all of our pupils should attain. If half of our pupils are abov' 
norm and half are below, we may conclude that our pupils compare favora 
with the pupils in the norm group. Whether this is good or bad depends on = 
number of factors, such as the ability of our pupils, the extent to which t e 
learning outcomes measured by the test reflect our curriculum emphasis, 4” 
the quality of the educational facilities at our disposal. If we are teaching pupils 
with above-average ability under conditions comparable to those of schools 1" 


Interpreting Test Scores and Norms 277 


the norm would be cause for concern. On 
ucational facilities are inferior to those in 
might call forth considerable pride. In any 


case, it is well to remember that the norm is merely an average score made 
by pupils in the standardization group. As such, it represents the typical per- 
formance of average pupils in average schools and should not be considered 
a standard of excellence to be achieved by others. 
A The popularity of grade norms is largely due to the fact that test performance 
is expressed in units that are apparently easy to understand and interpret. To 
illustrate, assume that we obtained the following grade equivalents for John, 


who is in the middle of the fifth grade. 


Arithmetic 5.5 
Language 6.5 
Reading 9.0 


arents, and pupils alike would recognize 


that John is exactly average in arithmetic, one year advanced in language, and 

three and a half years advanced in reading. Grade equivalents provide a com- 

mon unit with which we are all familiar. The only difficulty is that this famil- 

larity leads to interpretations which are misleading or inaccurate by those who 
< 


are unaware of the numerous limitations of grade norms. In fact, the limitations 
are so severe that over the years test specialists have made a concerted effort 
to have them replaced by more suitable scores. 

The most serious limitation of grade norms is that the units are not equal 
on different parts of the scale, or from one test to another. A year of growth in 
arithmetic achievement from grade 4.0 to 5.0, for example, might represent a 
much gerater improvement than an increase from grade 2.0 to 3.0 or grade 
8.0 to 9.0. Thus, being advanced or retarded in terms of grade units has a 


different meaning on different parts of the grade scale. A pupil who earns a 
is grade placement might be demon- 


grade equivalent several grades above hi s ; 
strating vastly superior achievement, OF performance just slightly above average. 
One reason that grade norms provide unequal units is that growth in school 
subjects is uneven, At grade levels where educational growth is rapid, grade 
units indicate large differences jn achievement, and where growth slows down 
grade units correspond to very small differences. This difficulty is further com- 
plicated by the fact that patterns of growth vary from subject to subject so 
that our grade units stretch and contract at different points of the scale for 
different subjects. In general, grade equivalents provide units which are most 
comparable when used at the elementary level in those areas receiving relatively 
consistent instructional emphasis—that is, arithmetic, language skills, and 
reading. 
A further limitation of grade norms is that high and low grade equivalents 
have dubious meaning. Raw scores corresponding to grade equivalents below 
grade 3 and above grade 9 are usually estimated (by extrapolation) rather than 
determined by measurement. This results in artificial units which do not cor- 


the norm group, merely matching 
the other hand, if our pupils or ed 
the norm group, reaching the norm 


In examining these scores, teachers, p: 


278 Using Standardized Tests 


Tespond to the achievement of pupils in any particular group. Tie | ee, 
grade equivalents is frequently necessary because the younger pupils all 
have the needed skills to take the test and because growth in the basic s 

tends to level off in the eighth and ninth 


s š uiva- 
grades. In interpreting grade eq 
lents at the extremes, 


therefore, it is well to keep in mind that they do not 
Tepresent the actual performance of pupils at these levels. s are 
The lack of equal units and the questionable value of extreme score aes 
especially troublesome when comparing a pupil’s relative performance "Table 
ferent subjects. This can be illustrated with the test data presented 1 mani 
14.2. These are actual distributions of test scores based on an peer 
battery administered to a classroom group of thirty beginning sixth-gra 


7 4 £ 3.0 to 
First, note that the reading scores range from a grade equivalent o 
10.4, and that the language 


range of grade e, 
which measure t 
of each distribut 
Smith 
tion. 


and arithmetic scores cover a much more Se 
quivalents. This is a typical finding with achievement a top 
he basic skills. Now look at the circled number near t alt 
ion of scores. This marks the grade equivalent earned BY shit 
on each of the tests. Note that he is at or near the top of each eer 
His grade equivalent scores and the percentage of the group center 
obtained scores lower than his are summarized below. In each case the 

of the class interval was taken to represent his grade equivalent score. 


Walt Smith's Percentage BoP jalen 
Grade Equivalents Walt Smith's Grade Equ 
Reading 10.2 K 
Language 8.2 95 
Arithmetic 7.2 97 
Table 14.2 


DISTRIBUTION OF GRADE EQUIVALENTS IN READING, ARITHMETIC, 
AND LANGUAGE FOR THIRTY SIXTH-GRADE PUPILS 


Frequencies z 
A - Arithmetic 
Grade Equivalents Reading Language 
10.0-10.4 @ 
9.5-9.9 
9.0-9.4 
8.5-8.9 
8.0-8.4 
7.5-7.9 
7.0-7.4 
6.5-6.9 
6.0-6.4 
5.5-5.9 
5.0-5.4 
4.5-4.9 
4.0-4.4 
3.5-3.9 
3.0-3.4 2 


1 
© 
1 
5 
6 
T 
6 
2 


tQ N Q S —1 = to G to — 


1 


* Circled numbers indicate the grade equivalent earned by Walt Smith. 


Interpreting Test Scores and Norms 279 


“ea 5 grade equivalents only, it is hard to resist the 
After all, he is only maryana lb aot + me ner So u ieee 
rth ee ee T grade level in arithmetic, ut four years 
lathe. bon, g These di erences are quite impressive. They are also mis- 
trom pa se ms = are Se dia to the inequality of grade units 
hae aa toon er. en Walt Smith’s performance on the tests is com- 
Ne. ies s of the percentage of the group falling below him (percentile 
a et saa as that his performance is identical on these two tests. In 
relative Sy waen compared to a group of sixth-grade pupils he holds the same 
thi position in arithmetic and reading. Although test scores based on 
ae, pupils do not represent very impressive evidence, these scores are typical 
ha iin from larger samples of pupils. The range of grade equivalent 
ri c aracteristically varies from one type of test to another, resulting in 
equal units and distorted comparisons between tests. 
E o common misinterpretation of. grade norms, although not due to 
heroes sna the scoring system itself, is to assume that if a pupil earns a 
“sega grade equivalent score in a subject he is ready to do work at that level. - 
Ç ample, we might conclude that a fourth-grade pupil should be doing 
sixth-grade work in language skills if he earns a grade equivalent of 6.0 on a 
language skills test. This assumption overlooks the fact that he can obtain a 
e his grade level by doing the less difficult 
test items more rapidly and accurately than the average fourth-grade pupil. 


T š À 

= grade equivalent score of 6.0 may represent nothing more than a thorough 

astery of language skills taught in the first four grades. Thus, grade equivalents 
At best, they are only rough guides to 


m never be interpreted literally. 

of test performance. Pupils at different grade levels who earn the same 
grade equivalent score are apt to be ready for quite different types of in- 
struction. 

In summary, grade norms are based on the average performance of pupils 
at various grade levels. They are widely used at the elementary school level, 
largely because of the apparent ease with which they can be interpreted. Grade 
equivalents scores are based on units which are typically unequal. This can 
lead to misinterpretations and tends to limit the usefulness of the test results. 
are most useful for reporting growth in the basic skills 
hool period. They are least useful for comparing a 
t tests. For whatever purpose grade norms are 
de units must be considered during interpre- 


or, š 
grade equivalent score well abov 


In general, grade norms 
during the elementary sc 
pupil’s performance on differen 
used, however, inequality of gra 
tation of the results. 


Grade Norms Based on Selected Groups 


prepared, the usual practice is to base the grade 


When grade norms are 
d at each grade level. This includes pupils who 


equivalents on all pupils teste 


Measurement and Evaluation in Psychology and Educa 


2R. L. Thorndike and E. Hagen, 
1961). 


tion (New York: John Wiley & Sons, 


280 Using Standardized Tests 


have been accelerated and those who have been retarded, as well as the P E E 
who are making regular progress through school. Since retardation 38 N a 
ticed much more frequently than acceleration, the typical grade norm is a 
on a group which contains a disproportionate number of overage pup aie 
low ability. The result is that grade norms reflect a lower level of perform 

than that achieved by the aver: 


norma 
age pupil progressing through school at a 
rate. 


on 
Some test publishers provide norms with grade equivalent scores a 
those pupils who are in the proper age range for their grade. The yen 
age norms is given to one such set of norms. These provide grade equ ntly at 
for pupils in the twelve-months age range which occurs most freie 
each grade level. This eliminates from the norm group those pupils iE 
school at an uncommon age or whose progress through school has been 3 Kee 
By omitting the pupils who have failed one or more grades, modal a 
Tepresent a higher level of achievement than do total-grade norms. The roe 
advantage of modal-age norms is that you can compare a pupil’s test p 
ance with that of pupils who are making normal educational progress. ears 
A compromise between modal-age norms and total-grade norms qe 
the latest revision of the Metropolitan Achievement Tests. In these h gra 
grade norms are based upon age-controlled samples of pupils at kas a 
level. Rather than the twelve-month age range used with modal-age nor: extreme 
ever, an eighteen-month age range was used. This eliminates the ely as is 
deviants at each grade level without restricting the norm group as et a more 
done with modal-age norms. Such age-controlled norms should ea at eac 
satisfactory representation of pupil performance than using all pupils 
grade level. les of 
Grade norms based on modal-age groups and age-controlled pee as 
pupils are interpreted in terms of grade equivalents, in the same ma esent 4 
other grade norms. The important thing to remember is that they k ee: a 
more select group of pupils at each grade level and consequently rep subject 
higher level of achievement. It is also well to keep in mind that they are 
to the same limitations and misinterpretations as total-grade norms. 


in 
the 
de 
we 


AGE NORMS 


¿ ages 
Age norms are based on the average scores earned by pupils at u oh 
and are interpreted in terms of age equivalents. Thus, if Pupils wmo o is 
years and two months of age earn an average raw score of 24, this ode 
assigned an age equivalent of 10-2. Tables of age norms in test manuals pallet 
Parallel columns of such raw scores and their corresponding age eguiya ade 
Age norms have essentially the same characteristics and limitations as one 
norms. The major differences are that (1) test performance is expressed in 


; : rk: 
8# W. N. Durost, Manual for Interpreting Metropolitan Achievement Tests (New Yo 
Harcourt, Brace & World, 1962). 


Interpreting Test Scores and Norms 281 


of age level rather than grade level, and (2) age equivalents divide the calendar 
year into twelve parts rather than ten. Age equivalents for ten-year-olds, for 
example, range from 10-0 to 10-11. Both grade norms and age norms are 
shown in Table 14.3. The use of the table is simple. Note, for instance, that a 
raw score of 24 corresponds to a grade equivalent of 5.0 and an age equivalent 


of 10-2 for this particular test. 


Table 14.3 
GRADE AND AGE NORMS FOR GATES’ LEVEL OF COMPREHENSION TEST* 
Raw Reading Reading Raw Reading Reading 
Score Grade Age Score Grade Age 
0 2.0 7-2 30 61 11-3 
1 21 7-3 31 6.2 11-5 
2 2.2 74 32 63 11-6 
3 24 1-1 33 64 11-8 
4 25 7-8 34 65 11-9 
: ae tel 35 6.6 1110 
6 27 7-11 68 121 
7 28 8-0 es ; 
8 Š 37 6.9 12-2 
Š 29 et 38 7.0 12-3 
e ga 39 72 12-6 
10 82 8-5 
u 33 8-6 40 73 12-7 
12 34 8-7 41 75 12-10 
13 3.5 8-8 42 7.6 12-11 
14 3.6 8-9 43 TT 13-0 
44 7.9 13-3 
15 3.8 9-0 
e 83 AE 45 82 13-6 
4.0 9-2 
18 J or 46 8.6 13-10 
42 
19 a oe 47 92 14-5 
š 48 97 15-0 
20 ai 0-7 49 10.1 15-5 
21 4.5 9-8 
22 4.6 9-9 50 10.4 15-8 
23 4.8 10-0 51 10.6 16-0 
24 5.0 10-2 52 lL 16-6 
53 11.3 16-8 
54 115 
2 g a 16-10 
26 5.3 10-5 
27 5.5 10-7 55 11.8 1721 
28 5.7 10-9 56 12.0 17-4 
29 5.9 10-11 57 12.5 17-10 


* From Manual for the Gates Basic Reading Tests, copyright © 1958 by Arthur I. Gates, 
Bureau of Publications, Teachers College, Columbia University. Used by permission. Š 


282 Using Standardized Tests 


Many achievement tests and most intelligence tests are provided with i: 
norms. In the achievement area, the age equivalents are frequently called y 
subject name. Thus, we speak of a pupil’s reading age, arithmetic age, apel 
age, and the like. Where the age equivalents refer to general achievement, suc! 
as that obtained from an achievement test battery, the term educational age or 
achievement age is commonly used. In intelligence testing, age equivalents are 
called mental ages. Despite the variation in terminology, all age norms a 
interpreted in the same way. A pupil with a reading age of 12-6 and a menta 
age of 13-0, for example, has earned a score on the reading test equal te that 
of the average twelve-and-a-half-year-old child and a score on the intelligence 
test equal to that of the average thirteen-year-old child. 

As with grade norms, age norms present test performance in units that are 
easily understood but which are characteristically unequal. Variations in patterns 
of growth at different age levels and from one type of ability to another causes 
age units to lack uniform meaning. A year of growth may represent a consider- 
able improvement in ability or only a slight increase. 

The use of age norms is most appropriate at the elementary school level where 
mental and educational growth tend to be continuous and somewhat regular. 
Mental age remains a useful concept at the high school level, but its meaning 15 
obscure for above average pupils. Since mental growth slows down at approxi- 
mately age sixteen, mental ages beyond this must be estimated by nen” 
Thus, a mental age of 18-0 does not represent the average performance © 


eighteen-year-olds. It does not represent the actual test performance of any age 
group. It is an artificial mental a 
persons whose mental abilit 
Needless to say, 
caution, 


ge used to express the test performance of 
y is superior to that of the average sixteen-year-old. 
such artificial mental ages must be interpreted with extreme 
Literal interpretations should be as assiduously avoided with age norms as 
with grade norms, An eight-year-old child with a mental age of 10-0 has superior 
mental ability but he is not necessarily ready to cope with the same mental tasks 
as the average ten-year-old child. His performance might be largely due to 
rapid and accurate work on the more simple items and exceptional success oh 
the number problems, or it might reflect some other combination of work skills 
and special ability. The mental age of 10-0 merely tells us that his raw score 15 
equal to the raw score of the average ten-year-old child. It does not tell us what 
combination of factors went into earning that score. As Cronbach‘ has noted, a 
mental age of 12-0 does not mean the same thing for a nine-year-old and a 
fourteen-year-old child since they make different errors on the test. 

In summary, age norms are based on the average performance of pupils at 
various age levels. Age units are characteristically unequal, and especially sub- 
ject to misinterpretation at the high school level where superior performance 
must be expressed in artificial units. Age norms are most useful for expressing 
growth in mental ability, reading ability, and similar characteristics which 


3 L. J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


Interpreting Test Scores and Norms 283 


- > y 
have fairly consistent growth patterns. If not interpreted literally, they also 
provide a useful basis for grouping pupils with somewhat similar ability and 
for making general comparisons of scholastic aptitude and achievement. 


The Use of Quotients 


Until recently, quotients were widely used with age equivalents to indicate 
rate of development. The best known of these is the ratio intelligence quotient 
(IQ) which is determined by dividing mental age by chronological age and 
multiplying by 100. Another is the educational quotient (EQ) which uses a 
similar formula but substitutes a subject age or general achievement age for 
mental age. Still another is the accomplishment quotient (AQ), determined by 
dividing an educational age by mental age and multiplying by 100. Although 
some of these are still in use, they are rapidly being discarded and are generally 
not recommended. The major shortcomings of such quotients are that they 
assume an equality of age units and a continuity of growth which simply do 
not exist. se 

The old ratio IQ is being replaced by a deviation IQ. This is not a quotient 
based on the relationship between mental age and chronological age, but a type 
of standard score. It is obtained directly from tables of norms in test manuals 
and interpreted like any other standard score. Its properties will be considered 
in more detail later in this chapter, when standard scores are considered. For 
now, the significant thing to note is that the ratio IQ and the educational 
quotients patterned after it are obsolete and soon will disappear from the educa- 
tional scene. Until then, they should be interpreted with extreme caution when- 
ever encountered. 


PERCENTILE NORMS 


One of the most widely used and easily comprehended methods of describing 
test performance is that of percentile rank. A percentile rank (or percentile 
Score) indicates a pupil’s relative position in a group 1n terms of the percentage 
of pupils scoring below him. Thus, if we consult a table of norms and find that 
a pupil’s raw score of 39 equals a percentile rank of 75, we know that 75 per 
cent of the pupils in the reference group obtained a raw score lower than 39. 
Stating it another way, this pupil’s performance surpasses that of 75 per cent 
of the group. 

One method of presenting percentile norms is show 
norms are for the Academic Promise Tests. The raw scores for each subtest, 
and for combined subtests, are listed in columns across the table. The procedure 
for obtaining the percentile rank for any given raw score is to locate the score 
in the proper column and then to read the corresponding percentile rank at 
the side of the table. For example, a raw score of 37 on the Abstract Reasoning 
test is equivalent to a percentile rank of 80, and a raw score of 19 on the Nu- 
merical test is equivalent to a percentile rank of 50. Only selected percentile 
tanks are given in the norm table and it is suggested that these be considered as 


n in Table 14.4. These 


284 Using Standardized Tests 


midpoints of a band of values. Thus, a percentile rank of 80 is interpreted as a 
percentile band ranging from 78 to 82. This is to allow for error in the 
measurement. 


Table 14.4 
PERCENTILE NORMS FOR ACADEMIC PROMISE TESTS* 
ACADEMIC PROMISE TESTS—NORMS 


FORMS A OR B—BOYS AND GIRLS 
GRADE 
Raw Scores 
N = 8140| Abstract 6 
Reason- Numeri- Language| AR+N V+LU APY |. ae 
Percentile ing cal Verbal Usage Total |Percentile 
99 494+ 40+ 45+ 48+ 86+ 89+ 168+ 99 
97 47-48 37-39 42-44 45-47 80-85 82-88 | 156-167 97 
95 45-46 33-36 39-41 40-44 74-79 76-81 144-155 95 
90 42-44 3032 36-38 36-39 6873 70-75 | 134-143 90 
85 39-41 28-29 34-35 33-35 63-67 66-69 | 125-133 85 
80 3638 26-27 32-33 31-32 59-62 62-65 | 117-124 80 
75 3335 2425 31 29-30 55-58 59.61 | 111-116 75 
70 30-32 — 23 30 27-28 51-54 56-58 | 105-110 70 
65 27-29 22 28-29 — 26 47-50 54-55 | 100-104 65 
60 24-26 21 26-27 25 44-46 51-53 | 95-99 60 
55 2223 2 25 24 42-43 48.50 | 91-94 55 
50 21 19 24 22-23 40-41 4647 | 87-90 50 
45 20 18 23 20-21 38-39 44-45 83-86 45 
40 19 17 22 19 36-37 42.43 80-82 40 
35 18 16 21 18 35 40-41 76-79 35 
30 17 15 20 17 34 38-39 73-75 30 
25 16 14 19 16 32-33 36-37 69-72 25 
"i 5 1 PIs 15 3031 33-35 | 65-68 Ë 
i z i 15-16 14 2829 3032 | 61-64 i 
5 a To 13-14 12-13 26-27 27-29 | 56-60 ; 
1 i 11-12 10.1 23-25 23-26 | 50-55 > 
: o - 8-10 8-9 21-22 20-22 44-49 
0-6 0-7 0-7 0-20 0-19 0-43 1 
ve Be 201 251 — 235 45.1 48.6 93.7 Mean 
13 78 87 97 167 170 30.9 sp 


* Reproduced by permission. C, igh š ië 
I i ; l Corpo! 
tion, New York, New York. All nahkan ` 7 ee 


ights reserved. 


` E p. 

A distinctive feature of percentile norms is that we can interpret a pupil's 
S a in terms of any group in which he is a member, or desires to 
ccome a member. Most commonly, performance is reported in terms of the 
pupil’s relative standing in his own 
however, we are more interested in 
completed second 
in the college 
centile norms, 


grade or age group. In some instances: 
how a pupil compares to those who have 
“year French, are majoring in home economics, or are enrolled 
‘preparatory program. Such comparisons are possible with per- 
The interpretations we can give to a particular score are only 


Interpreting Test Scores and Norms 285 


limited by the types of decisions we desire to make and the availability of the 
appropriate sets of norms. 

The wide applicability of percentile norms i 
When interpreting a percentile rank, we must always refer to the norm group 
on which it is based. A pupil does not have a percentile rank of 86. He has a 
percentile rank of 86 in some particular group. A raw score on a scholastic 
aptitude test, for example, may be equivalent to a percentile rank of 86 in a 
general group of high school seniors, 63 in a group of college-bound seniors, 
and 25 in a group of college freshmen in a highly selective college. Relative ~ 
Standing varies with the ability of the reference group used for comparison. 

A related inconvenience with percentile norms is that numerous sets of norms 
are usually required. We need a set of norms for each group to which we wish 
to compare a pupil. This is not especially troublesome at the elementary school 
and age-mates provide a suitable basis for 
comparison, At the high school level, however, where the curriculum becomes 
diversified and pupils pursue different courses, it becomes a rather acute prob- 
lem. Here we need sets of norms for pupils who have completed varying num- 
bers of courses in each subject area. For guidance purposes, we should also 
like norms based on the various educational and vocational groups in which 
Our pupils plan to seek admission. The difficulty of producing such a diverse 
collection of norms for any particular test is quite obvious. Test publishers 
usually make available norms within grades and within courses, but norms for 
Special purposes frequently must be developed locally. 

he major limitation of percentile norms is that percentile units are not equal 
on all parts of the scale. A percentile difference of ten near the middle of the 
Scale (e.g. 45 to 55) represents a much smaller difference in test performance 
than the same percentile difference at the ends (e.g, 85 to 95). This is 29 
because a large number of pupils tend to obtain scores near the middle, while 
relatively few pupils have extremely high or low scores. Thus, a pupil whose 
TaW score is near average can surpass another ten per cent of the group by 
increasing his raw score just a few points. On the other hand, a pupil with a 
relatively high score will need to increase his raw score by a large number of 
Points to surpass another 10 per cent, simply because there are so few pupils at 
that level, 

The inequality of units requires special caution when using percentile ranks. 

irst, a difference of several percentile points should be given greater weight at 
the extremes of the distribution than near the middle. In fact, small differences 
Near the middle of the distribution generally can be disregarded. Second, per- 
centile ranks should not be averaged arithmetically. The appropriate average 
When using percentile norms is the 50th percentile. This is the midpoint of the 

istribution and is called the median, or counting average. " 

In Summary, percentile norms are widely applicable, easily determined, and 
readily understood. A percentile rank describes a pupil’s performance in terms 
of the percentage of pupils he surpasses jn some clearly defined group. This 
might be his own grade or age group, or any group in which he desires to 


s not without its drawbacks.» 


l P 
evel where a pupil’s own grade- 


286 Using Standardized Tests 


become a member (e.g., college freshmen). More than one set of a 
usually required and percentile ranks must be interpreted in terms of the a 
norm group on which they are based. The most severe limitation of apne 
ranks is that the units are unequal. This can be offset somewhat by carefu 


š 5 ° «table 
interpretation, however, since the inequality of units follows a predictab 
pattern. 


STANDARD SCORE NORMS 


Another method of indicating a pupil's relative position in a group is a 
showing how far his raw score is above or below average. This is the approac 
used with standard scores, Basically, standard scores express test performance 
in terms of standard deviation units from the mean. The mean (M) is the oa 
metical average, which is determined by adding all scores and dividing by t F 
number of scores. The standard deviation (SD) is a measure of the spread 
scores in a group. Since the method of computing the standard deviation is ne 
especially helpful in understanding it, the procedure will not be presented nae 
The meaning of standard deviation, and the standard scores based on this W S 
can best be explained in terms of the normal probability curve (also called t 
normal distribution curve or simply the normal curve). 


The Normal Curve and the Standard Deviation Unit 


The normal curve is a symmetrical bell-shaped curve which has many useful 
mathematical properties. One of the mos 
interpretation is that when it is divided in 
tion under the curve conta 


t useful from the viewpoint of test 
to standard deviation units each a 
ins a fixed percentage of cases. This is shown in t l 
idealized normal curve presented in Figure 14.1 (for the moment, GE 
the raw scores beneath the figure). Note that 34 per cent of the cases a 
between the mean and +1 SD, 14 per cent between +1 SD and +2 SD, sI 

2 per cent between +2 SD and +3 SD. The same proportions, of course, ri 
to the standard deviation intervals below the mean, Only 0.13 per cent of t 

cases fall below —3 Sp or above +3 SD. Thus, for all practical purposes z 


"ian iations 
normal distribution of scores falls between —3 and +3 standard deviatio 
from the mean. 


To illustrate the value of stand 
position in a group, 
beneath the row of de 
The tests have the foll 


ard deviation units for expressing i 
raw scores from two different tests have been pan 
viations along the baseline of the curve in Figure 14.1. 
owing means and standard deviations: 


Test A Test B 
M 56 72 
SD 4 6 


. " ed 
Note in Figure 14.1 that the mean raw scores of both tests have a, 
at the zero point on the baseline of the curve. Thus, they have been arbitrar 


5 See appendix for computation of standard deviation, 


Interpreting Test Scores and Norms 287 


equated to zero. Next, note that +1 SD is equivalent to 60 (56 + 4) on Test A 
and 78 (72 + 6) on Test B; and that —1 SD is equivalent to 52 (56 — 4) on 
Test A and 66 (72 — 6) on Test B. If we convert all of the raw scores on the 
two tests to the standard deviation units in this manner, it is possible to com- 
Pare directly performance on the tests. For example, a raw score of 62 on 
Test A and 81 on Test B are equal because both are -1.5 standard deviation 
units above the mean. When this conversion to standard deviation units has been 
made, the raw scores are, of course, no longer needed. A +2.5 SD on Test A is 
Superior to a +2.0 SD on Test B, regardless of the size of the raw scores from 
which they were derived. The only restriction for such comparisons is that ~ 
the conversion to standard deviation units must be based on a common group. 


Standard 
Devi ations 


Test A 


Test B 


Figure 14.1. Normal curve, indicating the percentage of cases falling within each standard 
a 


eviation interval. 


the utility of the standard deviation 
mon scale which has equal 
f the normal curve. At 
cteristics of the nor- 


is Fac be gleaned from this discussion, 
> at permits us to convert raw scores to a com 
ete and which can be readily interpreted in terms o 
Point, it should be helpful to review a few of the chara 
mal curve which makes it so useful in test interpretation. 
Referring to Figure 14.1 again, note that 68 per cent (approximately two 
thirds) of the cases fall between — and +1 standards deviations from the mean. 
18 provides a handy bench mark for interpreting standard scores and the 
Standard error of measurement, since both are based on standard deviation 
“nits. Note also that the fixed percentages in each interval make it possible to 
Convert standard deviation units to percentile ranks. For example, —2 SD 
quals a percentile rank of 2, since 2 per cent of the cases fall below that point. 


288 Using Standardized Tests 


i se 
Starting from the left of the figure, it can be seen that each point on the ba 
line of the curve can be equated to the following percentile ranks. 


—2 SD = 2% 
—1 SD = 16% ( 2 + 14) 
0(M) = 50% (16 + 34) 


+1 SD = 84% (50 + 34) 
+2 SD = 98% (84 + 14) P. 
This relationship between standard deviation units and percentile ran i, 

ables us to interpret standard scores in simple and familiar terms. oer ae 
for this purpose, we must, of course, be able to assume a normal distri pris 
This is not a serious restriction in using standard score norms, however, 81 : 
the distribution of scores on which they are based usually closely ergs 
the normal curve. In many instances the standard scores are normalized. T x 
is, the distribution of scores is made normal by the process of corpi mae at 
centiles and converting directly to their standard score equivalents. Whi Ta 
is generally safe to assume a normal distribution when using standard P fe 
from tables of norms in test manuals, it is usually unwise to make suc pt 
assumption for standard scores computed directly from a relatively small nu 
ber of cases, such as a classroom group. 


Types of Standard Scores 


There are numerous types of standard scores used in testing. Since they ea 
all based on the same principle and interpreted in somewhat the same manner, 
only the most common types will be discussed here. 

z-Scores. The simplest of the standard scores, and the one on which i 
are based, is the z-score. This score expresses test performance simply an i 
directly in terms of the number of standard deviation units a raw or sg: 
above or below the mean. In the previous section, we discussed z-scores but di 
not identify them as such. The formula for computing z-scores is 


X—M 
SD 


z-score = 


where 


X = any raw score 
M = arithmetic mean of raw scores 
SD = standard deviation of raw scores 


You can quickly become familiar 
Taw scores in Test A or Test B o 
answer along the base line of 


Scores of 58 and 50 on Test A 
follows: 


with this formula by applying it to fine 
f Figure 14.1, and then visually checking y° a 
the curve. For example, z-scores for the ra 

(M = 56, SD = 4) would be computed as 


58 — 56 50 — 56 
4 = z == 


T= 


S is 
Tt should be noted that a z-score is always minus when the raw score 


Interpreting Test Scores and Norms 289 


patas the mean. Forgetting the minus sign can cause serious errors in 
: pretation. It is for this reason that z-scores are seldom used directly 
In test norms. They are usually transformed into a standard score system which 
uses only positive numbers. 

— The term “T-score” was originally applied to a specific type of 
eas — w based on a group of unselected twelve-year-old children. In 
aza ars, owever, it has come to refer to any set of normally distributed 
An ae which has a mean of 50 and a standard deviation of 10. T-scores 
Thus ained by multiplying the z-score by 10 and adding the product to 50. 

> 


T-score = 50 + 10 (z) 


This formula has the effect of moving the decimal point of the z-score one 
o; 2 z ‘ s 
point to the right and removing all negative numbers. Applying this formula 


to : 
the two z-scores computed earlier (z = .5, 2 = —1.5), we would obtain the 


following results: 

T = 50 + 10 (—15) = 35 

d a standard deviation of 10, 
ore of 55, for example, always 
and so on, Once the 
Test specialists 


T = 50 + 10 (.5) = 55 


es always have a mean of 50 an 
asa e T-score is directly interpretable. A T-sc 
Ones Me one-half standard deviation above the mean, a 
Lee T-scores is grasped, interpretation 1S relatively simple. 
y recommend that test publishers use the T-score system.® 

anne pe other standard scores are computed from z-scores in the same way 
ended 4 are determined, but they use different values for the mean and 
100 is vations For example, a mean of 500 and a standard deviation of 
ege E used with the Graduate Record Examination and the tests of the Col- 
ee N Examination Board. Consequently, on these tests a score of 550 
one-half standard deviation above the mean in the same manner as a 
Ae of 55. Standard scores can be assigned any arbitrarily selected mean 
OP nek andard deviation and the interpretation is the same, since the basic frame 
erence is the standard deviation unit. 
eviation IQ. Another widely used standard score is the deviation IQ. Here 
ne is set at 100 and the standard deviation at 16 (15 on some tests). 
siete ra with an IQ of 84 has scored one standard deviation below the 
Beat -score = 40), and a pupil with an IQ of 116 has scored one standard 
San ion above the mean (T-score = 60). Scores on intelligence tests could 
a dad easily be expressed in terms of standard scores with a mean of 50 and 
a ndard deviation of 10. However, the IQ concept is so deeply inbedded in 
a that a mean and standard deviation were selected which closely 

9x1mate the distribution of scores obtained with the old ratio IQ. 

me IQ’s in terms of standard scores gives them the advantage, over 
Q’s, of equal units and a common meaning at all age levels. They can 


the 


“Technical Recommendations for Psychological 


Ae x 
Merican Psychological Association, 
ement to the Psychological Bulletin, 51, 1954. 


ests : 
and Diagnostic Techniques,” Suppl 


290 Using Standardized Tests 


also be readily converted to percentile ranks and to other types of standard 
scores, as we shall see shortly. | F 

Stanines. Some test norms are expressed in terms of single-digit standar 
scores called stanines. This system of scores received its name from the fact 
that the distribution of raw scores is divided into nine parts (standard nines). 
Stanine 5 is located precisely in the center of the distribution and includes all 
cases within one-forth of a standard deviation on either side of the mean. The 
remaining stanines are evenly distributed above and below stanine 5. Each 
stanine, with the exception of 1 and 9 which cover the tails of the distribution, 
includes a band of raw scores the width of one-half of a standard deviation 
unit. Thus, for all practical purposes, stanines present test norms on a 9-point 


; d 
scale of equal units. These standard scores have a mean of 5 and a standar: 
deviation of 2. 


Comparison of Score Systems 


. The equivalence of scores in various standard score systems ana their rela- 
tion to percentiles and to the normal curve is presented in Figure 14.2. This 
figure illustrates the interrelated nature of the various scales for ae 
relative position in a normally distributed group. A raw score one gee 
deviation below the mean, for example, can be expressed as a z-score of —1.0, 


; ae i 3. 
a percentile rank of 16, a T-score of 40, a deviation IQ of 84, or a stanine of 
Thus, the various scoring systems are merely 


thing and we can readily convert back and 
(providing, of course, that we can ass P 
The relations among the scoring systems shown in Figure 14.2 are especially 
helpful in learning to understand a particular standard score scale. Until we 
fully understand T-scores, for example, it is helpful to convert them, mentally; 
into percentile ranks. A T-score of 60 becomes meaningful when we note that 
it is equivalent to a percentile rank of 84. This conversion to percentile ranks, 
which are more easily understood, is also useful for interpreting standard scores 
to parents and pupils. Still another value of knowing the relations among the 
scales is in comparing a pupil’s performance on tests using different scoring sys- 
tems. An IQ of 92 can be compared to a percentile rank of 16 (or a T-score 


° : : ee 
of 40) on an achievement test, for example, in order to determine’ the deg" 
of underachievement. It should be noted th 


that the norms for the different tests are com 
In summary, standard scores indicate a 
in terms of standard deviation units aboy 


š A me 
different ways of saying the a 5 
forth from one scale to anothe 
ume a normal distribution). 


at this type of comparison assumes 
parable. 

pupil’s relative position in a gry 
e or below the mean. In a norma 


at as $ . ; are 
distribution, the various standard score scales and the percentile scale 
: "Pg I ri 
interrelated, making it possible to convert Írom one to another. Standard sco 

h 


ave the special advantage of providing equal units. Thus, unlike percentiles; 
ten standard score points represents the same diflerence in test performance 
anywhere along the scale. In addition, standard scores can be averaged arith- 
metically. One drawback of standard scores is that they are not readily under- 
stood by pupils and Parents. À more serious limitation is that interpretation 15 


ji 


Interpreting Test Scores and Norms 291 


SD's 

re 

saas , do ao 20 io 0 "O 320 330 +40 
P . 

Steentiles 0.1 2 w 50. 84 98 9??? 

T- 

Asorei 30-30 40 à 50 60 ZO: 30 
Deviation IQ's 5 TOO 116. 132 148 


2 68 84 
Stanines 1 213 ]4 15161718 | Z 


Percent in 


stanine AL TY 12%17%20%17%12% 7% A% 
Bit 3 Pee MAA TE, 


n a normal distribution. 


Figu 
re 14.2, Corresponding standard scores and percentiles i 


uted. This is not a problem in 


di 
ifficult unless the scores are normally distrib 
m tables are generally based on 


usin š 
nis 8 standard score norms, however, since nor 
malized standard scores. 


PROFILES 


On s ‘ y š 
e of the major advantages of converting raw scores to derived scores is 


— hae oi performance on different tests can be compared directly. This is 
Such A one by means of a test profile, like the one presented in Figure 14.3. 
relative graphic representation of test data makes it easy to identify a pupil’s 
oni strengths and weaknesses. Most standardized tests have provisions for 
8 test profiles. 
iw student profile shown in Figure 14.3 illustrates a recent trend in profile 
on Instead of plotting test scores as specific points on the scale, test 
ait ies mance is recorded in the form of bands which extend one standard error 
fom suntan above and below the pupil's obtained scores. Tt will be recalled 
of ai of reliability, that there are approximately two chances out 
atire "9 at a pupil’s true score falls within one standard error of his obtained 
Fan kea these bands indicate the ranges of scores within which we can be 
Profile mi certain of finding the pupil's true standings. Plotting them on the 
ables us to take into account the inaccuracy of the test scores when 


(Cworsswaied 


Aq pas, ‘aataras ñuusə, L uoneonpy ‘uoistarg yay, saneradooy Aq Lel WSukdoD) `spueq ə1ioos jo əsu Surmoys ‘yog WEPMS CALS "SPL MIA 


"S3¥ODS ONI13W4831NI YOS WANVW 
° ° d3LS 4909 u! pəuto4uo2 amm suoro191diətut jo UOLSSNOSIP pə|tolop BOW 


a] 
= 


` ə9uə1əç ut Burpuoşs siy uoy 94514 
a ie ài m $1 sə!pniS Ip!oS u! sBuLpuO4s 5, JUPAS Əy} !dojuəA0 JOU op saipny 
| -95 pub 809195 104 spung “19AƏMOH “Ə2u9!2S PUD SOHDWIYJOW Jo 


Suns OY] “soa OMY OsOy) u! sBu!pupis s 1uopnis ət LOOMING əəuə:əjjip 
Ə2upliodiu! ou s! ay} !dojaəA9 səlpniS 101905 pun SIHOWƏYJOW 104 spuog 


o KS oz oz 
L Ld by zç - 1b i (gz) əəuə1os 
14-09 (VZ) so!prus Io1os 
& [] H m y ot 29-05 (Ve) soupuoulow ait sia) 
YA aau; 104 spuoq Ə|uuəo)od s,4uəpnis o “suuou |p3o| oj Bu!pJoa5y sojduoxg 
i Y Y *Puoq samoj 
PR GY Puoq samoj 
or L A o & or 24 Aq poyuasoidos Bulpuols uoyy 9l9q joos s! puoq s0yB1y ays Aq pajus 
A Y -aidas Buypuojs “dolu9Ao You op 54504 om Avo 40) spuoq Ə|Huo3JƏd ayy JJ "Z 
r 1 lÍ 7 "54504 OM) asoy uo sButpupis s, uaprys ayy uəəwləq əəuoəjjip 
É YY soins CT S 
dë A o & os :A|ddo sə|ru Buto||oJ 
kel p Əy “#91195 d31S Nl u! 5450) omy Aup uo sñulpupis suuopnis o asoduoo o 
7 u 8 
o p4 pA o f o9 199891 yam oB010A0 mojog s! souowsoyiod Buluəls? suuəpnis sya “SPiOm Jajo 
Hi U| “32y81y iu 49d pg puo 4uopnis sN uoy 19AO| 01095 dnorB suuou ayy u 


F 


SHUOPNYS JO quə2 1Əd HZ 4oy} mouy NOK “gg-pZ s! Puoqə|!uuəo1əd Buluəls!7 ay) 
31 “Ə|duoxə 404 *puoq ə|Huə2:əd oy) Ao|9q puo əA0qp utunjo2 ayy Jo sod 


oz o¿ 
lig Papoysun əy} ajou “pəsn dnou6 suuou Əd u! siuəprus JO JOY tH! sotu9s 431 
NI u! s154 941 Jo auo uo əəuouuoprəd s iuəpnis p a1odwWoD o] ° on 
08 ° *PƏsn 54504 OY) 104 s|onuou) a44 4|nsuo5 “SIYOS 
06 06 [ 06 
t U! sisay xis so uou so uo sao Ə|Huə2)əd suugpnis p 9|1Joud upə nok ə,oH 


SNI1384831NI YOJ TYANYW d31S 409 u! pəpn|əu! aso usoj 37] 3ONd Oy) uo 

Spung Ə|Huə1əd Burmosp puo uolipuuoju) Buyp10D01 105 suooosig *Bulpiosoy 
“SHIUOW ç 40 p jo ponad p ujm pasajsiuiupo əq p|nous pəpn|3u! sisaj 

HIP “PHI9A Əq oj spao ot uaamyoq suosupduuoə no 404 JOPIO u) “591105 4315 


a 
3 R 
ameOumZe— ow 


00 oor oo 
2 */Z sələ5uy soq D 
aS -fey ”uoləəulq 93119 Bulisə] |ouorioənp3 Gl UOISIAIG 459} 91 
Z= rose Ibe | poyoruoy P2A39591 Sj ”ZS61 1u6u4do5 (D) 
vz ve wog 14O posp Sunds 
le S0910 P039 oN joe saysiqndi 
Pi š may) zara San SS 
7122722 ° ? utd i era 42] easel Weg LP oog —= 7 aby 
ow 
SS3¥DOUd TWNOLVINGI JO S1S3L 1Vl1N3nO3S 7 212 10 90019 sara Iocsps 


L 31I3O3d 1N3qñn1S d31S = AAD Somar] N 


V 
—_-- 
— 

— a 


Interpreting Test Scores and Norms 293 


comparing performance on different tests. Interpreting differences between 
tests is simple with these score bands. If the bands for two tests overlap, we 
can assume that performance on the two tests does not differ significantly. If 
the bands do not overlap, we can assume that there is probably a real difference 
in performance. 
Mo An used with the Sequential Tests of Educational Progress are 
=s = rom the table of norms. Other provisions for plotting such bands 
rE pro es have also been made by test publishers. On the Academic Promise 
y » it is simply a matter of drawing a short cross line for the percentile 
pore and then making a band which extends one-half inch above and one-half 
inch below that score; An example of a student profile on the APT is shown in 
Figure 14.4. 
Ea individual profile chart for the M: 
aara ves a manner that the score J 
wide =~ hee in Figure 14.5 each score is plo 
Liens <a plotting is simply a matter of making a 
helei then drawing a line one stanine distance lo; 
of the dot. 
b. test publishers do not make provisions ; 
a ‘et ae more of them to follow this practice i 
fee © or special provisions, however. It is possible to plot these bands for any 
Stent which we have a standard error of measurement. All we need to do is 
T a the error band in raw score points and refer to the norm table with 
tie i For example, if a pupil has earned a raw score of 14 and the 
error of measurement is 3, his error band in raw score points ranges 


fro i 
was 71 to 77. By locating these two numbers in the norm table we can obtain 
Corresponding range in percentiles, standard scores, or in whatever derived 

profile. The use of such 


te being used, and plot the results directly on the ° : f 

Wi minimizes the tendency of test profiles to present a misleading picture. 

ithout the bands, we are apt to attribute significance to differences in test 

Performance which can be accounted for by chance alone. 

ae using profiles to compare test performance, it is essential that the 

ee E for all tests be comparable. Many test publishers provide for this by 
andardizing a battery of achievement tests and a scholastic aptitude test on 


thi 
© same population. 


JUDGING THE ADEQUACY O 


etropolitan Achievement Tests is ar- 
bands can be expressed in terms of 
tted as a band two stanines 
dot to represent the stanine 
ng to the right and to 


for plotting score bands, but we 
n the future. There is no need 


F NORMS 

manuals, we can easily and quickly 
noted earlier, these derived scores 
age equivalents, percentiles, or 


le the tables of norms provided in test 
way be Taw scores into derived scores. As 
eae expressed in terms of grade equivalents, ee 
in orm of standard score. Regardless of the type of oe score used, the 
teins ed is to make possible interpretation of pupils’ test performance in 
eo of a clearly defined reference group- This should not be just any refer- 
group, however, but one which provides a meaningful basis for comparison. 

The adequacy of test norms is a basic consideration during test selection and 


Cpanasar SWBU WV “KN “AO, May ‘voned 


~109 [OISOPONDAS IL “LOST ‘0961 ‘6SGL © vuBñuKdo 'uorssturad Aq panpordəy) “spueq Əioəs Jo əsn Jutmoys ‘ayyorg wapms LAY 


— 


sugou [OUOHOU uo pasoq ai sə|Huə2)od “nua ou j 
puo pəsoq I" El 


+ enuow ayi ut pəqtu2səp a1 udp16 oq əy} BULMOIp yo spouiƏw 


pas, swo *səso2s uonouiqtuop [oquƏA PUD |oqiəƏAuoN 
ay; uəəwqəq 10 s91095 ətp1odəs noj ayy Buowo Joy}! səouə)əjjip 
Əlon|oAƏ ol pasn əq Aow piopunys awos əy *4uopodwjun A\gogoid 
st əəuə;əjjIp ƏY} “YOU! uo j|eU UOY} ssə| s! ƏDUDJSIP ayy VIYM *|nJ 
—Sutuoəu  Á|uossəəəu you nq Ájqoqosd s! aouasayyip ayy 404) 59402 
-ipu Yyou!-s04 © Jo əəuoisip |OD!YJ9A V *SƏHHIGO SIY u! BDUEI0yIP 
Joas o syuasaidas Žuja Á|u10] 54594 95944 uo 533095 $, quəpnis 

it uy auai0yyp ayi"(\u9/s0100 jou * |A jooo) Lode Your 
auo uoy; sow aso sow OMY J! 1044 pəuBisəp os uəəq soy JOY Əy) 


*4504 4044 104 ə|Huə5 

10d 944 0} spuodsə:uo3 q31um |9A9| O44 JÐ uuun|o3 YD 550420 əul| 
oys Kaay D ayow *sajiy 91025 ayy sapun “1uB11 oli 1D 4049 Y4 ul 
JYVHD 31l4O3d JHL 


*4s1}6u3 poo6 jo uoiy201ddo “yooods 

pup Buum 4201109 jo 6ulpupisuopun — JOYSN JOYNƏNYI 
*Guyuoroas u! spiom 

© oyi puo sBujuoaw psom Bupudysiopun — 1V8N3A 
*sdiysuoyojos |[oətuətunu əsn puo puDjsiopun 

91 ‘suoj ƏAHDINUDnD u! yury oj Ay!20d09 ay} — 1VƏIN3WDN 
*ssequinu 10 spuOA u! uoy} Joyos 
sjoquiAs 10 swos6oyp jo uuoj ayy u! pojuasaid sidəəuoə əziuSopo, 

puo sdiųsuoyojas əəs o; Ájıjıqo əy} — ONINOSWIY 12VN1SgV 


PERCENTILES 
S3111N32383d4 


asn oy Ay 


Buipn|2u! ‘00m 10 Buoys ÁJƏAN9|Ə: s! 1uəpnis oy) 
qÐ 21wWapadn jo soap ayo1odas jo uolyo>1pul “Z 


*sopo16 o6010AD uoy; 1a/00d 40 əB5o;ƏAD uoy 

494199 u09 || puo “pupu 10 Áspə pom |ooq2s y6iy puD YBIY 
dogun] putj ||! ay sous Ai! |iqoqoid ayy — 1amod |oniəəl|əl 

40 opnyiido 9Hso|ouəs punouo—J|D sups ay) Jo jstosddo + | 

isasodind 

suopodw omy 9A195 Joys s01028 op! Acid 54501 Os}WoIg DIWapOdY oY 

1d¥ 3H1 NO $3¥03S 


VONIT q331XV8 


awoN 


9, 


26057) 
əonBuoq 


Joga [Ponowny Bujuosooy 
ponsqy 


N+av 
IPqi9AuoN 


NIHA 
Idy IEPA 


o | |o so a ss] sp | sz 
Tél 09 19 Ic | 6€] ez 05 6 SA ay A) 19 gIZ u 


n +A X9S Əp9:iO uuoj 103A |oou5ç 
idv [q9A 


u9DI0g 


abos 
abon6uo7 


I9q!9A I9313ƏuinN Butuosoog 
spousqy 


| “Buyyoud yo poyjow ojouso4fy `g 9|duos 
S.LSI.L ASINOUd IINAACVOY 


N+uv 
I9qiƏAuoN 


W8O4 
180438 
ANIGNIS 


PL canary 


Interpreting Test Scores and Norms 295 


Metropolitan Achievement Tests 


INDIVIDUAL 
PROFILE CHART ADVANCED BATTERY 
NAME Kowalski Virginia M. BOYD GIRL® 


i ior High 
GRADE PLACEMENT —-8— TEACHER Miss Haskell SCHOOL Samuels Junior Hig 


7,19 
DATE OF TESTING Oct+28196l DATE OF pirtH Oct: 7, 1948 AGE 


for 


FORM OF TEST USED —È NORMS: Local EI 


TEST STAN. %-ILE STA- STANINE 
SCORE RANK NINE } 2 3 4 5 6 7 8 ° 


Word Knowledge 56 70 


Reading 


Spelling 
Language 

Usage 

Punc.-Cop. 

Kinds of 

Sentences 

Parts of Speech 
. Grammar 

Total i 
Language 

Study Skills 46 22 
Arith. Comp. 52 5S 
Arith. Prob. Solv. 

and Concepts 46 35 
Social Studies 

Information 5/ SS. 
Social Studies 

Study Skills 45 33 
Science 49 _50 
OTHER TESTS 
Pintner General IQ 
Aiet AE. 


ee — 


š ag & žag 


STANINE ! 2 
t bilo 0 so 60707580 9% 95 


PERCENTILE RANK 5 


. 

an Achievement Tests. (From Manual for 
Grades 1-9, by Walter M. Durost. Copyright 
s reserved. Reproduction by permission.) 


Figure 14.5. Student profile on Metropolit 
Tong reting Metropolitan Achievement Tests, ‘ 
62 by Harcourt, Brace & World, Inc. All right 


296 Using Standardized Tests 


a factor to be reckoned with during the interpretation of test scores. The fol- 
lowing criteria indicate the qualities most desired in norms. 

1. Test norms should be relevant. Test norms are based on various types of 
groups. Some represent a national sample of all pupils at certain grade or age 
levels while others are limited to a given region or state. For special purposes, 
the norms might also be confined to a limited group such as high school pupils 
in independent schools, girls who have completed secretarial training in a com- 
mercial high school, or college freshmen in engineering. The variety of types 
of groups available for comparison makes it necessary to study the nature of 
the norm sample before using any table of norms. We should ask ourselves: 
Are these norms appropriate for the pupils being tested and for the decisions 
to be made with the results? A 

If we merely want to compare our pupils with a general reference group in 
order to diagnose strengths and weaknesses in different areas, national norms 
may be satisfactory. Here our main concern is with the extent to which our 
pupils are similar to those in the norm population on such characteristics as 
scholastic aptitude, educational experience, and cultural background. The more 
closely our pupils approximate those in the norm group, the greater is our 
certainty that the national norms provide a meaningful basis for comparison. 

Where we are trying to decide such things as which pupils should be placed 


tory curriculum, 
neering, national 


j ; ;] with his 
articular area, comparing a pupil with h 
potential competitors is more meaning 


% ` . t 
gful than comparisons with his presen 

grade- or age-mates, 
2. Test norms should be r 


. t 
&presentative. Once we have satisfied ourselves tha 
set of test norms is based 


r the norms are truly representative of that group: 


n a random sample of the pop 
lation they represent, This is extremely difficult and expensive, however, so We 


must usually settle for something less. As a minimum, we should demand that 
all significant subgroups of the population be adequately represented. For 
national norms, it is desirable to have a proper proportion of pupils from such 
subgroups as the following: boys and girls, geographic regions, rural-urban 
areas, socio-economic levels, racial groups, and schools of varying size. The 
most adequate representation in these areas is obtained when the norm sample 


closely approximates the population distribution reported by the United States 
Census Bureau, 


lace great emphasis on the large number 
P g p 


of pupils tested, with little attention to the sampling procedure used. As noted 


Interpreting Test Scores and Norms 297 


m these comments by a leading test publisher, size of the norm sample pro- 
vides an insufficient criterion for judging the adequacy of norms." 


Unfortunately, many alleged general norms reported in test manuals are not backed even 
by an honest effort to secure representative samples of people-in-general. Even tens or hun- 
dreds of thousands of cases can fall woefully short of defining people-in-general. Inspection 
show if information about the norms were given com- 


pletely) that many such massed norms are merely collections of all the scores that oppor- 
tunity has permitted the author or publisher to gather easily. Lumping together all the 
samples secured more by chance than by plan makes for impressively large numbers; but 
while seeming to simplify interpretation, the norms may dim or actually distort the signifi- 
cance of a score, 


of test manuals will show (or would 


Whether appraising national or special-group norms, we should always favor 
the carefully chosen, representative sample over the larger, but biased, sample 
based on availability of results. This requires going beyond the titles on the 
norm tables and making a careful study of the procedures used in obtaining the 
norm sample. If this information is unavailable, it is safe to conclude that the 
norms are probably not representative. 
š 3. Test norms should be up to date. One factor that is commonly neglected 
in judging the adequacy of norms is whether they are currently applicable. 
With the rapid changes that are taking place in education, we can expect test 
norms to become out of date much sooner than they did when the curriculum 
was more stable and there was less emphasis on accelerated programs. These 
changes can be expected to have the greatest influence on achievement test 
P s. but their effect on those of scholastic aptitude tests should not be over- 
9oked. Note these remarks by Cronbachš: 

Research on the Wech- 


on the average, 
be attributed to 


T . . 
est norms become obsolete and need to be checked periodically. 
gests that the scores of adults are, 


sler i L 
intelligence tests, for example, sug: 
de ago. These changes may 


bie ar those of similar age groups a deca 
asing level of education. 

i li is generally unsafe to use the copyright date on the test manual as an 
indication of when the norms were obtained, since this date may be changed 
Whenever the manual is altered (no matter how slightly). The description of 
tie, Procedures used in establishing norms should be consulted for the year in 
waich the norm groups were tested. Where a test has been revised, it is also 
desirable to make certain that the norms are based on the new edition. 

4. Test norms should be comparable. It is frequently necessary, or desirable, 
to compare directly scores from different tests. This is the case when we make 
Profile comparisons of test results to diagnose a pupil’s strengths and weaknesses, 
Or compare aptitude and achievement test scores to detect underachievers. Such 
comparisons can be made precisely only if the norms for the different tests are 
Comparable. Our best assurance of comparability is obtained when all tests 


` H. G. Seashore and J. H. Ricks, Norms Must be Relevant, Test Service Bulletin No. 39. 


e 
A York: The Psychological Corporation, 1950. 


ew Cronbach, Essentials of Psychological Testing 


(New York: Harper & Row, 1960), 


298 Using Standardized Tests 


have been normed on the same population. This is routinely done with = os 
in an achievement battery, and some test publishers also administer a et pss 
aptitude test to the same norm group. Whenever the scores from pag em 
are to be compared directly, the test manuals should be checked to dete ‘i 
whether the norms are based on the same group, or if not, whether they ha 
arable by other means. _. 
k: e esta should be adequately described. It is difficult to aeia a 
test norms provide a meaningful basis of comparison unless we know m ai 
about the norm group and the norming procedures used. The type of in a 
tion we might expect to find in a test manual includes the following: (1) me a 
of sampling, (2) number and distribution of cases included in the norm enp 
(3) characteristics of the norm group with regard to such factors as age, § K 
race, scholastic aptitude, educational level, socio-economic status, pair ad 
schools represented, and geographic location, (4) extent to which stan oe 
conditions of administration and motivation were maintained during the pines 
and (5) date of the testing, including whether it was done in the fall s: ws 
spring. Other things being equal, we should always favor the test for whic! ee 
have detailed descriptions of these and other relevant factors. Such informati 


icular 
is needed if we are to judge the appropriateness of test norms for our partic 
purposes.® 


THE USE OF LOCAL NORMS 


For many purposes, 
test manuals. If our pu 
on such 


local norms are more useful than the norms provided uF 
pils deviate markedly from those in the published norms 
characteristics as scholastic aptitude, 
tural background, for example, 
ingful. Local norms are also de 
group, such as an accelerated 
reading. Where it is desired t 
from tests standardized on di 


educational experience, OF cul- 
comparison to a local group may be more means 
sirable when we are selecting pupils for a specific 
Program in mathematics or a retarded class s" 
9 make profile comparisons of scores obtaine 


š sed 
flerent populations, local norms can also be u i 
to obtain comparable scores by administering all tests to a common local group: 


In general, the more emphasis we give to instructional uses of test results, the 
greater will be our need for local norms. 

Some test manuals provide detailed directions for preparing local ween’ 
These are most commonly based on percentile ranks because of the enso u 
which such norms can be computed. A typical procedure is illustrated in “ 
ure 14.6, This is a completed worksheet which was used to construct local norm! 


A s ‘ee š i nual 
on one of the Cooperative English Tests. The directions given in the test ma! 
may be summarized as follows: 


I i a > ; ments 

° American Educational Research Association and National Council on Measuren Na 
Used in Education, Technical Recommendations for Achievement Tests (Washington: 
tional Education Association, 1955). 


1° R. L. Thorndike and E. Hagen, Measurement and Evaluation in Psychology and Educa 
tion (New York: John Wiley & Sons, 1961). 


Interpreting Test Scores and Norms 299 


1. The scores are grouped by intervals of two (Column 1). 
2. À tally is made of the number of students earning scores falling in each 


group (Column 2). 
3. The tallies are added and placed in the frequency column (Column 3). 
re added from the bottom up (e.g, 1 + 3 


4, The frequencies in Column 3 a 
Col- 


= 4, 4 + 7 = 11, and so on) to determine the cumulative frequency ( 
umn 4). 


For use with Score 


ah and College Ability Tests (SCAT) Distribution 
equential Tests of Educational Progress (STEP) Sheet 


and all other Cooperative Tests 
s =s . -oo SEuny / 762 
ame of Test Cooperative English pease: Foam2A Time of Testing silver Spring 


School, College, or Group Newoort C Schools Grade or Class £L 
Other Characteristics of Local Norms Group 137 boys and. sel gabs 
Percentile 


| Frequency Cumulative 
Frequency 


Percentile 


Í 
Total Number of Students - + - + ° "°°" * 


Z300 


This form is i l Directions for recording information 
a worksheet for preporing loco norms. Di > 
one campuratios erect, Pree, and 1960 or later editions of other See tests 
ore he ne fe MANUAL. FOR INTERBRETING SCORES for each test used. 
Cooperative Test Division @® Educational Testing Service Princeton, 
© Copyright 1957, All rights reserved New Jersey Los angeles 27, California 


Figure 14.6. Example of how local norms are computed. (From Manual for Interpreting 
Cores, Cooperative English Tests, Copyright 1960 by Cooperative Test Division, Educational 
‘axa h a 

Sting Service. Used by permission.) 


300 Using Standardized Tests 


3. The percentile rank (Column 5) for each score group is computed as fol- 
lows (using the score interval of 128-129 to illustrate) : 
(a) Find one-half the frequency of the score group (4% X 10 = 5). 
(b) Add the result of (a) to the cumulatiue frequency for the score group just 
x below it (5 + 11 = 16). 
r (c) Divide the result of (b) by the total number of students in the norm group, 
and multiply by 100 (16 — 300 = .05 05 X 100 = 5). 
6. The percentile bands (Column 6) are obtained by consulting special tables 
in the manual. These bands take into account the error present in the scores 


and are interpreted in the same manner as those shown in the profiles presented 
earlier. 


It should be noted that, with the exception of the percentile bands, this pro- 
cedure can be used to construct local percentile norms for any test. 


CAUTIONS IN INTERPRETING TEST SCORES 


Interpreting test scores with the aid of norms requires an understanding of 
the type of derived score used and a willingness to study carefully the charac- 
teristics of the norm group. In addition, however, we need to keep in mind the 
following general cautions which apply to the interpretation of any test score. 

1. A test score should be interpreted in terms of the specific test from which 
it was derived. No two scholastic aptitude tests nor achievement tests measure 
exactly the same thing. Achievement tests are especially prone to wide variation 
and the differences are seldom reflected in the test title. For example, one arith- 
metic test might be limited to simple computational skills while another contains 
a large number of reasoning problems, Similarly, one science test may be con- 
fined largely to items measuring knowledge of terminology while another with 
the same title stresses the application of scientific principles. With such variation 
it is misleading to interpret a pupil’s test score as representing general achieve- 
ment in any particular area. We need to look beyond test titles and to evaluate 


a3 Ç 
the pupil's performance in terms of what the test actually does measure. 


2. A test score should be interpreted in light of all relevant characteristics of 
the pupil. Test performance is i 


: nfluenced by the pupil’s aptitudes, educational 
experiences, cultural background, emotional adjustment, health, and the like. 
Consequently, when a pupil performs poorly on a test, it is desirable to first 
consider the possibility of cultural deprivation, a language handicap, improper 
motivation, or similar factors which might have interfered with the pupil’s 
Tesponse to the test. If the test is an achievement test, we must, of course, also 
take into account the pupil’s scholastic aptitude. A low ability pupil perform- 
ing two years below his grade level might be progressing at a rate satisfactory 
for him. On the other hand, a bright pupil performing two years beyond his 
grade level might be achieving far short of his potential. 

3. A test score should be interpreted in terms of the type of decision to be 
made. The meaningfulness of a test score is determined to a considerable extent 
by the use to be made of it. For example, an IQ score of 100 would have dif- 


Interpreting Test Scores and Norms 301 


ferent : s 
p oe ee e l l 
a DES DEbosld ba ë ent in hig school, or trying to decide whether 
te Lame een ncourage to go to college. We shall find test scores much 
ys ama en er stop considering them as high or low “in general,” and 
made, ing their significance in relation to the’ specific decision to be 
Eoee should be interpreted as a band of scores rather than a specific 
during os se wary is cee to error and this error must be allowed for 
Piei aee z a E best means of ae this is to consider a 
=a AST as a and of scores one standard error of measurement 
s cond Ween j is obtained score. For example, 
bani SA, ard error is 3, his test performance should be interpreted as a 
P ronles oe re score 53 to score 59. Such bands were illustrated in the 
minke divers en Even where they are not plotted, however, we should 
we from her a or these error bands surrounding each score. This will prevent 
eee ae interpretations which are more precise than the test results 
were signific ing small chance differences between test scores as though they 
5. A Zes ant can only lead to awqsoneus decisions. J I! 
Preting test Score should be verified by supplementary evidence. When inter- 
ES mc scores, it is impossible to determine fully the extent to which the 
slimes of testing have been met (Les maximum motivation, equal 
Ceni precis pagor an and so on), or to which the conditions of testing have 
quently oe (ie., administration, scoring, and so on). Conse- 
taken he addition to the predictable error of measurement, which can be 
idem nn with standard error bands, n an 
ditions, Ou e cet of error due to unmet agan 
single t r only protection against such errors 1S 
gle test score, As Cronbach" has noted: 


if a pupil earns a score of 


a test score may contai 
mptions or uncontrolled con- 
to place little reliance on a 


t scores are merely data on 


testing is that tes 
background facts, and they 


e coordinated with 
h other available data. 


The 

m š 

which Foa helpful single principle in all 

must be v, ase further study. They must b 
erified by constant comparison wit 


cores would be substantially reduced 


Themis 
if this misinterpretation and misuse of test s 
In simple principle were more widely recognized. 
Passing, it is wise to note that this caution should not þe restricted to test 


Scores . 
a ig is merely a specific application of the more general rule that no im- 
decision should ever be based on one limited sample of behavior. 


SUMMARY 

e fact that the raw scores obtained 
re there is “no achievement at all”) 
d minutes). In an attempt to com- 
d to make test scores more readily 


Ao eee is complicated by th 
and equal ack a true zero point (point whe: 
Pensate f units (such as feet, pounds, an 

or the lack of these properties an 


11 iA 
- J. Cronbach, Essentials of Psychological Testing (New York: Harper & Row, 1960). 


302 Using Standardized Tests 


interpretable, various methods of expressing test scores have been o aa 
most common procedure is to convert raw scores into derived scares y zA š 
of tables of norms. These derived scores indicate a pupil’s relative position E 
a clearly defined reference group. They have the advantage over raw B 
providing more uniform meaning from one test to another and from one s 
tion to another. : 

i Test norms merely represent the typical performance of pupils in mal go 
ence groups on which the test was standardized and consequently shou e 
be viewed as desired goals or standards. The most common types of norms $) 
grade norms, age norms, percentile norms, and standard score neni’ on 
type has its own unique characteristics, advantages, and limitations, whic 

be taken into account during test interpretation. 

Grade norms and age norms describe test performance in terms of the par 
ticular grade or age group in which a pupil’s raw score is just average. ante 
norms are widely used at the elementary school level, largely due to the appar 
ent ease with which they can be interpreted. Describing test performance in terms 
of grade and age equivalents can frequently lead to unsound decisions, pna 
because of the inequality of the units and the invalid assumptions on whi 
they are based. I r 

Percentile norms and standard score norms describe test performance in ee 
of the pupil’s relative standing in a group in which he is a member or sae 
to become a member. A percentile rank indicates the percentage of pupils falling 
below a particular raw score. Percentile units are unequal, but the scores end 
readily understood by persons without special training. A standard score indi- 
cates the number of standard deviation units a raw score falls above or below 
the group mean. It has the advantage of providing equal units which can be 
treated arithmetically, but persons untrained in statistics find it difficult to 
interpret such scores. Some of the more common types of standard scores are 
z-scores, T'-scores, deviation 1Q’s, and stanines, h 

With a normal distribution of scores, we can readily convert back and fort 
between standard scores and percentiles. This makes it possible to utilize the 
special advantages of each. Standard scores can be used to obtain the benefits 
of equal units and we can convert to percentile equivalents when interpreting 
test performance to pupils, parents, and others who lack statistical training. 

A pupil’s performance on several tests which have comparable norms os 
be presented in the form of a profile. This makes it possible to identify readily 
areas of strength and weakness, Profile interpretation is more apt to be accurate 
when standard error bands are plotted on the profile. ‘hi 

The adequacy of test norms can be judged by determining the extent to ss 
they are (1) relevant, (2) representative, (3) up to date, (4) comparable, an 
(5) adequately described. In some instances, it 
norms than published norms. Where local nor 
can be readily computed, 


: š al 
is more appropriate to use loc I 
ms are desired, percentile norm: 


peq n ; ë 
In addition to a knowledge of derived scores and norms, the proper interp™ 
4 h 5 
tation of test scores requires an acute awareness of (1) what the test measures; 


Interpreting Test Scores and Norms 305 


(2) ch isti 

ae pasa and background of the pupil, (3) type of decision to be 

a aa amount of error in the score, and (5) extent to which the score is 

pai i y with other available data. No important educational decision should 
e based on test scores alone. 


SUGGESTIONS FOR FURTHER READING 


Anastasi, À 
, Anne. Psychological Testing. 2nd edition, New York: Macmillan, 1961. Chapter 4: 


ieee T Nature and Interpretation.” 
Chapter 8: oa Building a School Testing Program. Boston: Houghton Mifflin, 1963. 
nanie : “Test Norms.” A good description of the development and use of local stanine 
Davi 
sale aos Measurements and Their Interpretation. Belmont, California: Wads- 
Chapter53 pi Co., 1964. Chapter 8: “The Interpretation of Individual Test Scores.” 
3 Change.” ° he Interpretation of Group Scores.” Chapter 10: “The Measurement of 
uro: 
a an iy A. Prescott. Essentials of Measurement for Teachers. New York: 
"i Norms.” * e & World, 1962. Chapter 5: “What the Teacher Needs to Know About 
ind 
re 5 The Impact and Improvement of School Testing Programs, Sixty- 
The Univ arbook of the National Society for the Study of Education, Part i. Chicago: 
Test Seo ereily of Chicago Press, 1963. Chapter 12: Ohlsen, M. M., “Interpretation of 
Lyman, H P Recent research on test interpretation is reviewed. . 
Hall 1568. Test Scores and What They Mean. Englewood Cliffs, New Jersey: Prentice- 
Schrader W. = simple, lucid extension of the material presented in this chapter. 
> W. B. “Norms,” Encyclopedia of Educational Research. 3rd edition, New York: 


Maem? 

hase ee ee 
Educasi R. L. and Elizabeth Hagen. Measu 
surement New York: John Wiley & Sons, 19 


Test Bulletins 


Durost 
Wen, N N. The Characteristics, Use, and Computation 0, 
» No. 23, New York: Harcourt, Brace & World, 1961. 


ation in Psychology and 


rement and Evalu 
orms and Units of Meas- 


61. Chapter 6: “N 


f Stanines. Test Service Note- 


Rank d on Test Data and Teachers’ 
s. Test Service Bulletin, No. 86, New York: "Brace & World, 1957. 
i ice Bulletin, No. 48, New 


* 
Also 
see footnote 7 in this chapter. 


Chapter 15 
evaluating learning 
and development: 


Direct observation provides the only means we have for evaluating some 
d it provides supplementary 


as P 
"ya of learning and development . . - ar 
formation concerning others. The problem is ..- how to get an objective 


record of the most meaningful behavior? This can be greatly facilitated 


through the use of such techniques as: (1) anecdotal records, (2) rating 


scales, and (3) checklists. 


oo have noted in previous chapters, a large number of learning outcomes 
s e measured by paper-and-pencil tests. This is especially true of outcomes 
In the cognitive domain, such as those pertaining to knowledge, understanding, 
“d thinking skills. The significance of these areas in all subject-matter fields 
has placed Paper aad pencil testing in a prominent and central role in educa- 


ti 
ional evaluation. This is as it should be, but we must be careful not to become 


s : 
italy dependent on paper-and-pencil testing. There are a number of important 


eating] changes that require the use of other procedures. . 
dhe iat outcomes in skill areas and behavioral changes in personal-social 
q opment are especially difficult to evaluate with the usual paper-and-pencil 
- A list of such outcomes, with representative types of pupil behavior, is 
Presented in Table 15.1. This list is by no means complete but it is comprehen- 
sive enough to illustrate the great need to supplement paper-and-pencil testing 
With other methods of evaluation. 
” Learning outcomes and aspects of development like those in Table 15.1 can 
šenerally be evaluated by one of the following procedures: (1) observing the 
Pupil as he performs and describing or judging his behavior (evaluating a 
Speech), (2) observing and judging the quality of the product resulting from 


307 


308 Evaluating Procedures, Products, and Typical Behavior 
Table 15.1 


OUTCOMES REQUIRING EVALUATION PROCEDURES BEYOND 
THE TYPICAL PAPER-AND-PENCIL TEST 


Outcome Representative Behaviors 


Speaking, writing, listening, oral reading, performing laboratory eme 
Skills ments, drawing, playing a musical instrument, dancing, gymnastics, WOT! 
skills, study skills, and social skills, 


Work Effectiveness in planning, use of time, use of equipment, use of nee 
ri the demonstration of such traits as initiative, creativity, persistence, de 
abits pe 
pendability. 
Social Concern for the welfare of others, respect for laws, respect for the property 
o! ren curiae ary! Shae Jes 
A of others, sensitivity to social issues, concern for social institutions, desire 
Attitudes aaah he 
to work toward social improvement. 
Scientific Open-mindedness, willingness to suspend judgment, sensitivity to cause- 
Attitudes effect relations, an inquiring mind. 
Taterests Expressed feelings toward various educational, mechanical, aesthetic, scien- 


tific, social, recreational, vocational activities. 


. P: " : i t, 
Feeling of satisfaction and enjoyment expressed toward nature, music, ar 


Appreciations * Š š API 
BR literature, physical skill, outstanding social contributions. 


Najustments Relationship to peers, reaction to praise and criticism, reaction to autho’ 
ity, emotional stability, social adaptability. 


his performance (evaluating handwriting) , (3) asking his peers about him 
(evaluating social relationships), and (4) questioning him directly (evaluating 
expressed interests). Although these observational techniques, peer-appT' aisals, 
and self-report methods are more subjective than we would like, and their use 
frequently requires more time and effort than the typical testing procedure, 
they provide the best means available for evaluating a variety of important 
behaviors. Our choice is simple: either we use these techniques in an attempt 
to evaluate each learning outcome and aspect of development as directly and 
validly as possible, or we neglect those that cannot be measured by paper-and- 
pencil tests. From an educational standpoint, the choice seems obvious. 
In this chapter, we shall describe those observational techniques found espe 

cially useful by teachers. These include: 

Anecdotal Records 

Rating Scales 

Checklists 


The following chapter will be devoted to peer appraisals and self-report tech- 
niques, 


ANECDOTAL RECORDS 


Teachers’ daily observations provide them with a wealth of information con- 
cerning the learning and development of their pupils. For example, a third- 


Observational Techniques 309 


that Mary mispronounces several 


grade teacher notices during oral reading 
window, and that Jane keeps 


simple words, that George sits staring out the 
interrupting the reading with irrelevant questions. Similarly, a high school 
chemistry teacher notices during a laboratory period that Bill is slow and 
inefficient in setting up his equipment, that John finishes his experiments early 
and helps others, and that Betty handles the chemicals in a careless and dan- 
Gorons manner despite repeated warnings. Such daily incidents and events have 
special evaluative significance. They enable us to determine how a pupil typi- 
cally performs or behaves in a variety of situations. In some instances, this 
information merely supplements and verifies data obtained by more objective 
methods, In other cases, it provides the only means we have for evaluating 
desired behavioral changes. 
. gained through observation are apt to provide an incomplete and 
‘nd picture, however, unless we keep an accurate record of our observations. 
simple and convenient method of doing this is provided by anecdotal records. 
Anecdotal records are factual descriptions of the meaningful incidents and 
events which the teacher has observed in the lives of his pupils. Each incident 
is described shortly after it happens. The descriptions may be recorded on 
Separate cards like the one shown in Figure 15.1, or running accounts, one for 
each pupil, may be kept on separate pages in a notebook. A good anecdotal 
record keeps the objective description of an incident separate from any inter- 
Pretation of the meaning of the behavior. For some purposes, it is also desirable 
to provide an additional space for recommendations concerning ways to improve 


Class 4th Grode _ Pupil_Bill Johnson 
Date_ 4/25/63 Place Classroom Observer_M. G- 


INCIDENT 
As class was about to start, Bill asked if he could read a poem to the class~ 
one he had written himself- about "spring." He read the poem in a low voice, 


moved his right foot back and forth, and 


constantly looked down at the papery 


Pulled on his shirt collar. When he finished, Jack (in the back row) said "j 
couldn't hear it. Will you read it again— louder? " Bill said "no" and sot down. 


INTERPRETATION 
ect considerable creative 


and poems and they refl 


Bill enjoys writing stories 
ability. However, he seems very shy and nervous in performing before a group.» 


gain seemed to be due to his nervousness.» 


His refusal to read the poem a 


Fi 
igure 15.1. Ancedotal record form. 


310 Evaluating Procedures, Products, and Typical Behavior 


the pupil’s learning or adjustment. Such recommendations are seldom made, 
however, until a series of anecdotes have been accumulated.? 


Uses of Anecdotal Records 


The use of anecdotal records has frequently been limited to the area of social 
adjustment. While they are especially appropriate for this type of reporting, 
this is a needless limitation. Anecdotal records can be used for obtaining data 
pertinent to a variety of learning outcomes and to many aspects of personal and 
social development. The potential usefulness of the anecdotal method can be 
revealed by reviewing the various areas of learning outcomes presented earlier 
in this chapter; you will note that many of the behaviors listed there can be 
appraised by means of direct observation. 

The problem in using anecdotal records is not so much what can be evaluated, 
but rather what should be evaluated, with this method. It is obvious that we 
cannot observe and report on all aspects of pupil behavior, no matter how 
useful such records might be. Thus, the time-consuming nature of the task 
requires that we be selective in our observations. 


Deciding What Behaviors to Observe and Record 


In general, our objectives and desired outcomes will guide us in determining 
what behaviors are most worth noting. In addition, we must also be alert to 
those unusual and exceptional incidents which contribute to a better under- 
standing of each pupil’s unique pattern of behavior. Within this general frame- 
work, there are several steps we can take to limit and control our observations 
so that a realistic system of recording can be developed. They are: 


1. Confining our observations to those areas of behavior that cannot be 
evaluated by other means. 


2. Limiting our observations of all pupils at any given time to just a few 
types of behavior. 


3. Restricting the use of extensive observations of behavior to those few 
pupils who are most in need of special help. 


There is no advantage in using anecdotal records to obtain evidence of learn- 
ing in areas where more objective and usable methods are available. Knowl- 
edge, understanding, and various aspects of thinking skill can usually be evalu- 
ated by paper-and-pencil tests. Many learning outcomes of other types, such as 
the ability to give a speech, operate a microscope, or write a theme, are a 
effectively evaluated by rating methods or by product evaluation. Records a 
actual behavior are best used to evaluate how a pupil typically behaves ma 
natural setting. How does he approach a problem? How persistent is be n 
carrying out a task? How willing is he to listen to the ideas of others? What 
activities seem to attract his interest? What contributions does he make to class 
activities? How effectively does he work with others? How does he respond to 
praise and criticism? Noting a pupil’s verbal comments and actions in various 


1A. E. Traxler, Techniques of Guidance (New York: Harper & Row, 1957). 


Observational Techniques 511 


n — 
com oe, pe provides certain clues to his attitudes, interests, appreciations, 
These are Ka as patterns, that cannot be obtained by any other means. 
When kee Pie bie of behavior toward which we should focus our attention 
ha e g anecdotal records. 
sitple ar ber hope for with anecdotal records is a fairly representative 
This shee! ak n avior in the different areas in which we desire information. 
a ass sit h. be obtained more easily if we concentrate our observations on 
atentan tó a a For example, an elementary teacher might pay particular 
lution ae ing interests during the free reading period, to signs of appre- 
Similarly % es and art, and to patterns of social relations during recess. 
ankn je school science teacher might concentrate on incidents reflecting 
UAn. we e during certain class discussions and laboratory periods, and 
iteli will a and laboratory skills during others. In some cases the activity 
thers the an n the types of observation most fruitful to focus on, while in 
manes tent asis at any given time may need to be determined in an arbitrary 
' pite the concentration of attention on certain areas at a particular 


time, h 
» however, we should always be alert to other incidents and events which 


have : 

8 “he 3 2 : 

pecial significance for understanding the pupil's learning and development. 
ll pupils, there are times when 


In aA 
a š Ç š 
ddition to recording some information on a 


We nee ef . 
d more comprehensive information regarding a relative few. The severely 


Pe oe the socially rejected child, and the gifted underachiever are 
ie ire eed needing special attention. More extensive observations of such 
remedial iian pful in understanding their difficulties and in providing clues for 
Wë congenite < The most complete and useful information is obtained when 
Sbseruations: to our observations on one or two pupils at a time. During such 
other pupils it may also be necessary to restrict further our record keeping on 
— inter become discouraged when they first use anecdotal records 
specific t ey attempt to do too much. Limiting observations and reports to 
to make ypa of behavior, to specific pupils, or both, is frequently necessary 
and wosk ty procedure feasible. It is much better to have a clearly delimited 
collecti able observational plan than to end up with an incomplete and atypical 
ion of unrelated incidents. 
dotal Records 


ecords is that they 
The old adage 


Adv 
antages and Limitations of Anec 


advantage of anecdotal r 
ior in natural situations. 
as direct application here. A pupil 
s but not use a handkerchief when 
ut approach his 


art the most important 
that Rok escription of actual behav 
may sho ions speak louder than words h 
e e good knowledge of health practice 
liorar OF coughs: he may profess great in 
y work in a haphazard and distintere: 


great 
of eat ee for the welfare of others but beh 
ual behavior provide a check on other evaluation methods and also enable 


us to 5 š . 
in C nape the extent of change in the pupil’s typical patterns of behavior. 
a iti Ae Par Sas 
dition to compiling descriptions of the most characteristic behavior of 


terest in science b 
sted fashion; or he may express 
ave in a selfish manner. Records 


312 Evaluating Procedures, Products, and Typical Behavior 


5 that are 
a pupil, anecdotal records make possible gathering evidence w o Ja 
exceptional but significant. Typical examples are the quiet pop gesture, thë 
in class for the first time, the hostile pupil who makes a ms upil who 
extreme conformist who shows a sign of originality, and the apar = = apt to 
shows a spark of interest. These individually significant Ee to be eral 
be excluded by other evaluation techniques, They are also likely 


incidents. 
q ch incide: 
looked by teachers unless a concerted effort is made to observe su 


; es 
; increas 
A RA i vation and 
Keeping anecdotal records makes us more diligent in obser 


n 
ords ca 
e elementary teacher is that anecdotal p: nite 
pupils and with others who are retarded in ; 


an! 
g children tend to be more nepe 
uninhibited in their actions, their behavior is easier to observe and required jn 
The major limitation of anecdotal records is the amount of ek t somewhat 
maintaining an adequate system of records. Though this can be a al a time 
by limiting observations and reports as suggested earlier, it is si e only, h° 
consuming task. If a teacher keeps anecdotal records for his own us 


an 
h day 3" 
lan by starting with a few anecdotes eac! 


y 
as man 
it is desirable to have all teachers ae antl 
for a period of a few weeks, and then 


jtutes 8 
- constitu 
meeting to discuss the recorded anecdotes and to decide what 


t 
t mus 
7 : er tha 
reasonable number. It is generally unwise to set a specific numb: eful 
be recorded each week, but 


as a us 
an approximate minimum can serve . the 
general guide. When anecdot, 


most time-consuming aspect i 
of the summaries in the pu 
work can be handled by the cl 

Another serious limitatio 
objective when observing a 
a series of verbal ‘ 


uch 
pil’s cumulative records. Of course, m 
erical staff. . ð 
n of anecdotal records is the aiei a 
nd reporting pupil behavior., Ideally, we ils act 
“snap shots” which accurately represent the tae popes 
: es a °s own biases, 
behavior. This is seldom attained, however, for the teacher’s For example: 
and preconceived notions enter into his observations and "s u best an 
. va É i I 
he will tend to notice more desirable qualities in those pupils en ° the effec 
more undesirable qualities in those he likes least. If he is evaluating will be 
tiveness of a new teaching technique in which he has great faith, a= ym, a 
d 2 š 

tend, to note positive results and to ignore the negative. If he their 
th rity less well coordinated than girls, he wil] tend to perceive 

at boys are s iy, 
performance skills as being of lower pat 
reporting can reduce such distortions ast 
eradicated entirely. When anecdotal ng 
of teachers, however, the biases of any pa 
in the total pattern. 


ua 


Training in observation ye 
inimum, but they cannot Š 
€ accumulated from a numbe 
T teacher become less influenti4 


Observational Techniques 313 


A related difficulty is that of obtaining an adequate sample of behavior. When 
a pupil is participating in class discussion, he may be so tense and anxious 
that he appears cold and unfriendly toward others and his ideas seem dis- 
organized. When observed in less formal settings, such as in the laboratory or 
on the playground, his behavior might be quite different. Similarly, a pupil may 
appear highly motivated and interested in mathematics class but bored and 
disinterested during English literature, or he may be attentive and inquisitive 
In science one day and apathetic the next. Everyone’s behavior fluctuates some- 
what from situation to situation and from one time to another. Therefore, to 
obtain a reliable picture of a pupil’s typical pattern of behavior we need to 
observe him over a period of time and in a variety of situations. This also 
implies that general interpretations and recommendations concerning a pupil’s 
adjustment will be delayed until a fairly adequate sample of behavior is obtained. 


Improving the Effectiveness of Anecdotal Records 


In the previous sections, we have stated or implied a number of ways to im- 
Prove procedures for observing and reporting pupil behavior. These and other 
Points are listed below in a series of suggestions for the effective use of anec- 
dotal records, 

l. Determine in advance what to observe, 
Ye are more apt to select and record meaningful incidents if we review objec- 
tives and outcomes and decide which behaviors require evaluation by direct 
observation—that is, those that cannot be effectively evaluated by other means. 

° can further focus observations by looking for just a few specific types of 

chavior at any given time. While such directed observations are highly desir- 
able for obtaining evidence of pupil learning, there is always the danger that 
unique incidents which have special value for understanding a pupil’s develop- 
Ment will be overlooked, Consequently, we must be sufficiently flexible to note 
and report any unusual behavior in the event that it may be significant. 

. 2. Obserue and record enough of the situation to make the behavior mean- 
ingful. It is difficult to interpret behavior apart from the situation in which it 
Occurred, An aggressive action, such as pushing another child, for example, 
might reflect good-natured fun, an attempt to get attention, a response to direct 

me hostility. Clues to the meaning of behavior 
frequently can be obtained by directing attention to the actions of the other 
Pupils involved and to the particular setting in which the behavior took place. 


The record, therefore, should contain those situational conditions which seem 
A . 
5 he pupil's behavior. 


necessary for understanding t 
7 ident as soon after the observation as possible. 


3. Make a record of the ineteen are see 
In most cases it is infeasible to write a description of an incident at the time 


it happens. However, the longer we delay in recording observations, the greater 

the likelihood that important details will be forgotten. Making a few brief notes 

at opportune times following behavioral incidents and completing the records 
provides a feasible and satisfactory procedure. 


after school generally ; ee 
4. Limit each anecdote to a brief description of a single specific incident. 


but be alert for unusual behavior. 


Provocation, or a sign of extre: 


314 Evaluating Procedures, Products, and Typical Behavior 


Brief and concise descriptions take less time to write, less time to read, and are 
more easily summarized. Just enough detail should be included to make the 
description meaningful and accurate. Limiting each description to a single 
specific incident also simplifies the task of writing, using, and interpreting the 
records, 

5. Keep the factual description of the incident and your interpretation of ë 
separate. The description of an incident should be as accurate and objective as 
you can make it. This means Stating exactly what happened in clear and nom 
judgmental words, Avoid such terms as lazy, unhappy, shy, hostile, sad, ambi- 
in persistent, and the like, If used at all, reserve such words for the separate 
Section in which you give your tentative interpretations of the incident. There 
JS no need to interpret each incident, but when interpretations are given they 
should be kept separate and clearly labelled as such. 

6. Record both positive and negative behavioral incidents. There is a general 
tendency for teachers to note more readily those behaviors which disturb them 
personally and which interfere with the on-going process in the classroom 
The result is that anecdotal records frequently contain a disproportionate num 
ber of incidents which indicate the lack of learning or development. For evalua 
tion purposes, it is equally important to record the less dramatic incidents 
which provide clues concerning the growth that is taking place. Thus, a con 
scious effort should be made to observe and record these more subtle positive 


settings that his basic pattern of behavior begins to emerge. Consequently, 

should generally delay making any judgments concerning his learning °” 

development until we have a sufficient sample of behavior to provide a reliable 
ons. 

8. Obtain practice in writing anecdotal records. At first, most teachers have 
considerable difficulty in selecting significant incidents, in observing them 
accurately, and in describing them objectively. Some training and practice is 
therefore desirable before embarking on the use of anecdotal records, If the 
entire school staff is involved, a regular inservice training program should be 
provided. Where an individual teacher wants to explore their use in his own 
classroom, the aid of a supervisor or fellow teacher can be helpful in appraising 
the quality of the records. 


RATING SCALES 


In contrast to the unstructured descriptions of behavior obtained with anec- 
dotal records, rating scales provide a systematic procedure for obtaining and 
reporting the judgments of observers. Typically, a rating scale consists of a set 
of characteristics or qualities to be judged and some type of scale for indicating 


Observational Techniques 315 


the = 
š arkara attribute is present. The rating form itself is merely 
pupils ai Ta i. value in appraising the learning and development of 
priateness with rei „upon the care with which it is prepared and the appro- 
icone u ich it is used. As with other evaluation instruments, it should 
its use should be. ca with the learning outcomes to be evaluated and 
to make the — ned to those areas where there isa sufficient opportunity 
a ratiàg scale š sary observations. If these two principles are properly applied, 
observation H N several important evaluative functions: 0) it directs 
vides a wasta vard specific and clearly defined aspects of behavior, (2) it pro- 
characteristi on frame of reference for comparing all pupils on the same set of 
ics, and (3) it provides a convenient method for recording judgments 


of the observers, 


T 
ypes of Rating Scales 
but the majority of them can 


cribed below. Each type will 
le for rating “contributions 


a sige take many specific forms, 
e illustrated a belonging to one of the types des 
to class dis by using two dimensions from a sca 
cussion,” 

oo Rating Scale. One of the simplest types 
character zater checks or circles a number to indicate the degree to which a 
tecrit m aS present. Typically, each of a series of numbers is given a verbal 
Cases, thi P: which remains constant from one characteristic to another. In some 

» the rater is merely told that the largest number is high, 1 is low, and the 


other 
n š : 
umbers represent intermediate values. 


of rating scales is that 


whe: 


EXAMPLE 


Direct; 
eci > a š 2 " x 
tions: Indicate the degree to which this pupil contributes to class discussions by en- 
circling the appropriate number. The numbers represent the following values: 
5—outstanding, 4—above average, 3—average, 2—below average, and 1—un- 


1. To satisfactory. 
what extent does the pupil participate in discussions? 


27 1 2 3 4 5 : iont 
© what extent are the comments related to the topic under discussion? 


1 2 3 4 5 


The numerical rating scale is useful when the characteristics or qualities to 


rated can be classified into a limited number of categories and when there 
ning the category represented by each number. As 


mbers are only vaguely defined, so that con- 


Siderable variation in the jnterpretation and use of the scale occurs. 

Graphic Rating Scale. The distinguishing feature of the graphic rating 
Seale is that each characteristic is followed by a horizontal line. The rating is 
made by placing a check on the line. Typically a set of categories identifies 
Specific positions along th 
Points if he desires- 


is g 
a general agreement concer 
o 

mmonly used, however, the nu 


e line, but the rater is Íree to check between these 


316 Evaluating Procedures, Products, and Typical Behavior 


EXAMPLE 


Directions: Indicate the degree to which this pupil contributes to class discussions by 
placing an x anywhere along the horizontal line under each item. 
1. To what extent does the pupil participate in discussion? 
l | | | | 
never seldom occasionally frequently always 
2. To what extent are the comments related to the topic under discussion? 


j | | | | 


never seldom occasionally frequently always 


The scale shown in this example uses the same set of categories for each 
characteristic and is commonly referred to as a constant-alternatives scale. Where 
these categories vary from one characteristic to another the scale is called, quite 
logically, a changing-alternatives scale. 

Although the line in the graphic rating scale makes it possible to rate at inter- 
mediate points, the use of single words to identify the categories has no great 
advantage over the use of numbers. There is little agreement among raters 
concerning the meaning of such terms as seldom, occasionally, and frequently. 
What is needed are behavior descriptions which indicate more specifically what 
pupils are like who possess various degrees of the characteristic being rated. 

Descriptive Graphic Rating Scale. This rating form uses descriptive 
phrases to identify the points on a graphic scale. The descriptions are thumb- 
nail sketches which convey in behavioral terms what pupils are like at different 
steps along the scale. In some scales, only the center and end positions are 
described. In others, a descriptive phrase is placed beneath each designated 
point. A space for comments is also frequently provided, to enable the rater to 


clarify his rating or to record behavioral incidents pertinent to the characteristics 
being rated. 


EXAMPLE 
Directions: Make your ratings on each oí the following characteristics by placing pia 
anywhere along the horizontal line, under each item. In the space Ons 
ments, include anything that hel i i 
a g elps clarify your rating. 


xtent does the pupil participate in discussions? 


Never participates; 


À 3 Participates as Participates more 
quiet, passive much as other than any other 
group members group member 


COMMENT: 


2. To what extent are the comments related to the topic under discussion? 


Comments ramble, 


‘ Comments usually I Comments are 
distract from pertinent, occa- always related to 
topic sionally wanders " topic 
from topic 
COMMENT: 


. . . . i or 
The descriptive graphic rating scale is generally the most satisfactory f 


school use. It clarifies to both the teacher and the pupil the types of behavior 


Observational Techniques 317 


that represent different degrees of progress toward desired learning outcomes. 
The more specific behavior descriptions also contribute to greater objectivity 
and accuracy during the rating process. 

Ranking Methods. Some rating procedures do not require a printed scale. 
Probably the most applicable and best known of these is the simple rank-order 
method. With this approach, the pupils (or products) being rated are merely 
ranked in the order in which the rater estimates they possess the characteristic 
being judged. Typically, the rater will rank from both ends toward the middle. 

or example, a teacher may rank all of his pupils in order of their “participa- 
tion in class discussion,” by indicating the one who is highest in participation, 
then the one who is lowest, then the one who is next highest, next lowest, and 
SO on, until a complete ranking is obtained. Ranking from the ends toward 
the middle simplifies the procedure and increases the likelihood that the pupils 
will be properly ranked. Even with this refinement, however, ranking is a cum- 
ersome procedure if the number to be ranked is large. 
Practical modification of the usual ranking method requires the rater to 
Sort the pupils (or products) into a given number of groups. This is essentially 
the procedure followed when essay questions, English themes, and various types 
of class projects are divided into groups, on the basis of over-all quality, for 
the Purpose of assigning letter grades (i.e., A, B, C, D, and E). This division 
Into groups is also frequently used as a first step in complete ranking. The 


Pupils (or products) are sorted into a series of graded groups and then ranked 
Within each group. 
more 


precise, though time-consuming, procedure for obtaining a ranking 
of pupils ( 


or products) is by means of the paired-comparison method. With 

1S approach, each pupil is paired with every other pupil and the rater indi- 
Cates which one of each pair is superior in the characteristic being rated. A 
Simple tally of the number of times each pupil is checked as superior provides 
t © basis for ranking. Since the rater is required to judge whether each pupil 
1s better or worse than each of the other pupils in the group, the results tend 
to be more reliable than those obtained by the usual ranking procedure. The 
number of comparisons required, however, severely curtails the use of this pro- 
cedure. It is most useful where there is a relatively small number of pupils (or 
Products) to be rated, and in those research situations where reliability is given 
aà much higher priority than the practicality of the procedures. 

Tn comparison to rating scales, ranking methods have one important advan- 
tage, They force the rater to differentiate among the pupils being rated. He 
cannot rate them all high or all average as is possible on a rating scale. With 
Tanking, the pupils (or products) must be placed in relative order from high 
to low. The two major limitations of ranking procedures are: (1) they do not 
Provide behavioral descriptions of the pupils, and (2) the meaning of a rank 
oe on the size and nature of the group. Ranking tenth in a group of ten 

quite different from ranking tenth in a group of forty. Also, a pupil who ranks 


t. i š " y š 
En in a gifted group might be superior to a pupil who ranks first in a regular 
classroom group. 


318 Evaluating Procedures, Products, and Typical Behavior 


Uses of Rating Scales 


Rating scales can be used in the evaluation of a wide variety of learning 
outcomes and aspects of development. As a matter of convenience these uses may 
be classified into three major evaluation areas: (1) procedure, (2) product, 
and (3) personal-social development. 

Procedure Evaluation. In many areas, achievement is expressed directly 
in the performance of the pupil. Typical examples include the ability to give a 
speech, manipulate laboratory equipment, work effectively in a group, sing, 
play a musical instrument, and perform various physical feats. Such activities 
do not result in a product that can be evaluated, and paper-and-pencil tests are 
generally inadequate. Consequently, the procedures used in the performance 
itself must be observed and judged. 

Rating scales are especially useful in evaluating procedures because they 
direct our attention to the same aspects of performance in all pupils and they 
provide a common scale on which to record our judgments. If the rating form 
has been carefully prepared in terms of specific learning outcomes, it also serves 
as an excellent teaching device. The dimensions and behavior descriptions used 
in the scale make clear to the pupil the type of performance desired. 

Two items from a typical rating scale for evaluating a speech are presented 
in Figure 15.2. The first part of the form is devoted to the content of the speech 
and how well it is organized. The second part is concerned with aspects of 
delivery such as gestures, posture, appearance, eye contact, voice, and enun- 
ciation. In developing such a scale, a teacher must, of course, include those 
characteristics which are most appropriate for the type of speaking ability to 
be evaluated and for the age level at which the evaluation is to be made. 

Product Evaluation. Where pupil performance results in some type of 
product, it is frequently more desirable to judge the product rather than the 
procedures. The ability to write a theme, for example, is best evaluated by 


judging the quality of the theme itself. Little is to be learned by observing the 
pupil as he writes the theme. In some areas, 


such as typing, cooking, and wood- 
working, 


it might be most desirable to rate procedures during the early phases 
of learning and products later, after the basic skills have been mastered. In any 
event, product rating provides desirable evaluative information in many areas. 
In addition to those already mentioned, it is useful in evaluating such things 
as handwriting, drawings, maps, graphs, notebooks, term papers, book reports, 
and various types of objects made in vocational courses. 

A rating scale serves somewhat the same purposes in product evaluation that 
it does in procedure evaluation. It helps us to judge the products of all pupils in 


terms of the same characteristics and it emphasizes to the pupils those qualities 
desired in a superior product. 


In some instances, it is necessary or desirable to judge a product in terms of 


its over-all quality rather than its separate features. Where this is the case, the 
products may be simply placed in rank order, or they may be compared to a 


Observational Techniques 319 


SPEECH RATING SCALE 


Directions: Rate the pupil's speaking ability by placing an x anywhere 
along the horizontal line, under each characteristic. In 
the space for comments, include anything that helps clarify 
your rating or further describes the pupil's speech behavior. 


A. Content and Organization 


1. Opening remarks 


Inappropriate: Commonplace. No Arouse interest. 
Distract from particular contri- Direct attention 
speech topic. bution to the speech. to speech topic. 


COMMENT: 


B. Delivery 


2. Gestures 


Movements are Generally effective. Natural, expressive 
monotonous or Some distracting movements which 
distracting. mannerisms. emphasize speech. 


COMMENT: 


Figure 15.2. Sample items from speech rating scale. 


product scale, A product scale is a series of samples of the product which have 
been carefully graded to represent different degrees of quality. An example of 
such a scale, for evaluating handwriting, is presented in Figure 15.3. The scale 
ls used by moving a sample of the pupil’s handwriting along the scale until the 
quality of the writing matches. The pupil’s handwriting is then assigned the 
value indicated on the scale. 

Product scales can be used in judging the quality of any product, but in 
most areas teachers will need to develop their own scales. This can be readily 
done by selecting samples of pupil work which represent from five to seven 
levels of quality and arranging them in order of merit. The levels can then be 
assigned a value from one to seven and each of the remaining pupil products 
can be compared to the scale and rated in terms of the quality level it matches 
most closely, Such a scale might be developed each time a set of products is to 
be evaluated, or a more permanent scale might be developed and made avail- 


320 Evaluating Procedures, Products, and Typical Behavior 


GRADE AGE EQUIV. 
f PLACEMENT HANDWRITING SCALE (IN B. 


Dae ek AY feu tail came l 


4.5 p> Z= rut ZZ <= ee. i 
W. A qik biorun fox þik Lanati 


Figure 15.3. Handwriting scale used in the California Achievement Tests. (Copyrigbt 
1957 by California Test Bureau. Used by permission.) 


Observational Techniques 321 


able for pupil guidance. The latter procedure is to be favored where the product 
is fairly complex and difficult to construct. 

Evaluating Personal-Social Development. One of the most common uses 
of rating scales in the schools is in the rating of various aspects of personal- 
social development. Most report cards have a special place for rating the pupils 
on such attributes as citizenship, interest, effort, classroom conduct, and coop- 
eration. In addition, teachers are frequently required to rate each pupil on a 
standard rating form at periodic intervals. Typically, these ratings are on such 
traits as leadership, initiative, responsibility, honesty, ability to get along with 
others, and emotional stability. 

The rating of personal-social characteristics represents quite a different 

process from that used in procedure and product evaluation. When judging 
Procedures and products, the ratings are usually made during or immediately 
after a period of directed observation. In contrast, ratings in the area of 
Personal-social development are typically obtained at periodic intervals and rep- 
resent a kind of summing up of the general impressions a teacher has formed 
about his pupils. The ratings are based on observation, to be sure, but the 
observations tend to be casual and spread over an extended period of time. We 
can generally expect such ratings to reflect more of the teacher’s feelings and 
personal biases than those obtained at the end of a period of planned and 
directed observation. 
x Despite the greater subjectivity of the ratings, tapping a teacher’s general 
™pressions of his pupils provides useful evaluative information. It has been 
shown that, when obtained under proper conditions, such impressions can be 
reliably Teported and are related to various criteria of adjustment.2 Also, the 
type of impression a person makes upon others in formal and informal situations 
1S, In itself, an important dimension of personal-social development. 

Except for a few older but carefully developed rating instruments like the 
Haggerty-Olson-W ichman Behavior Rating Schedules (see Figure 15.4), there 
ls a dearth of published scales in this area. Consequently, such rating forms 
must usually be developed locally. While this has its disadvantages, local instru- 


ments can frequently be built to fit more closely the particular objectives of 
the school. 


Common Errors in Rating 


Certain types of errors occur so frequently and persistently in ratings that 
Special efforts are needed to counteract their influence.? These include errors 
due to (1) personal bias, (2) halo effect, and (3) logical error. 

Personal bias errors are indicated by a general tendency to rate all individuals 
at approximately the same position on the scale. Some raters tend to use the 
high end of the scale only. This is probably the most common type of bias and 

2 W. C. Olson, Child Development (Boston: D. C. Heath, 1959). 


3J. W. Wrightstone, “Observational Techniques,” 


° Encyclopedi 
(3rd edition, New York: Macmillan, 1960). pedia of Educational Research 


322 Evaluating Procedures, Products, and Typical Behavior 


26. Is he easily discouraged or is he persistant ? 


Melts before Gives up before Gives Persists until Never 
slight obstacles adequate everything convinced of gives in, —— 
or objections trial a fair trial mistake Obstinate 
(5) (3) a) (2) (4) 


27. Ishe generally depressed or cheerful? 


Dejected, Generally Usually in Cheerful, Hilarious 
Melancholic, dispirited good humor Animated, = 
In the dumps Chirping 

(3) (4) (1) (2) (5) 


28. Ishe sympathetic? 


Inimical, Unsympathetic Ordinarily Sympathetic, Very 
Aggravating, Disobliging, ' friendly and Warm-hearted affectionate —— 
Cruel Cold cordial 
(5) (4) (2) a) @ 


" les. 
Figure 15.4, Sample items from Haggerty-Olson-Wickman Behavior Rating Schedu 


ed 
(Copyright 1930, Copyright renewed 1958, by Harcourt, Brace & World, Inc. Reproduc 
by permission.) 


is referred to as the generosity error. Occurring much less frequently but a 
persistence for some raters is the severity error, whereby the lower end of t 
scale is favored, Still a third type of constant response is shown by the a 
who avoids both extremes of the scale and tends to rate everyone average. r i 
is called the central tendency error. It also occurs much less frequently tha 
the generosity error but it tends to be a fixed response style for some raters. m 
The tendency of a rater to favor a certain position on the scale has two i ; 
sirable results, First, it makes a single rating of an individual of dubious Taka ñ 
A high or low rating might reflect the personal outlook of the rater rather t ia 
the personal characteristics of the person rated. This is not quite so serious w: 
a school setting as it might be elsewhere. however. In a local school situation, pei 
are apt to know the rating habits of individual teachers and are thus able sa 
discount their tendencies to overrate or underrate. Second, favoring a Tasa 
position on the scale limits the range of any given individual’s ratings. es 
fore, even if we make allowances for a teacher’s general tendency to rate pup 
high, the ratings for different pupils may be so close together that they fail to 
provide reliable discriminations. of 
The halo effect is an error that occurs when a rater’s general impression B 
a person influences how he rates him on individual characteristics. If the rate 
has a favorable attitude toward the person being rated he will tend to rate him 
high on all traits; if his attitude is unfavorable he will tend to rate him low- 


Observational Techniques 323 


š S rater tends to 
This differs from the generosity and severity errors where the 
rate everyone high or everyone low. ilt ceive similar ratings on all char- 
o re ñ 
Since the halo eflect causes a pupa h on different traits. 
acteristics, it tends to obscure his strengths and sa general impres- 
° " A o 
This obviously limits the value of the ratings, a a Ë hi < he influences 
sion the pupil has created might be a valid indication of hov 
others. s Za more alike, or 
A logical error results when two characteristics ar aa as 2 Ea 
less alike, than they actually are because of the rater’s be x aes sete des 
H n 
relationship. In rating intelligence, for example, ware irei Man ike 
intelligence of pupils with high achievement because my ae iy sas 
two characteristics to go together. Similarly, oe adjustment will tend to 
3 8 ; ye poor socia 
but false, belief that gifted pupils a a. errors, here, do not result from 
i teristics. The , a 
underrate them on somal. chayaq i itions on the rating scale, but rather 
biases toward certain papier o merihein pas ae k nature. He assumes a 
from the rater’s preconceived notions concerning y an pl Sperm 
higher or lower relationship among traits than actually exis 
ingly. ; i rather disconcerting 
The various types of errors which appear in ratings are pre 5 
to the cl teacher who must depend on rating scales for evaluating cer- 
e classroom teac! , an 
tain aspects of learning and development. Fortunately, however, the errors c 
D 
i use. 
be markedly reduced by proper design and proper 


Principles of Effective Rating 


i i i f the 
The i nt of ratings requires careful attention to selection o 

ehapacteristioe to Ta rated, design of the rating brary and ee wasa 
which the ratings are obtained. The following pay i Sa — 
important considerations in these areas. Since the escrip grap = k 
Scale is the most generally useful form for school paasa Ps i A 
directed specifically toward the construction and use o ìs type of rating 
YY itia should be educationally significant. Rating Scales, like 
other evaluation instruments, must be in harmony with the objectiv ' 
desired learning outcomes of the school. Thus, when constructing or selecting 
a rating scale the best guide for determining what characteristics are most 
significant is our list of specific learning outcomes. Where these 
stated in behavioral terms, it is often simply a matter of sel 
can be most effectively evaluated by ratings and then modifyi 
to fit the rating format. 


es and 


have been clearly 
ecting those that 
ng the statements 


2. Characteristics should be directly observable. 
this. First, the characteristics should be limited to 
situations so that the teacher has an opportunity to observe them. Second, they 
should be characteristics that are clearly visible to an observer, Overt behaviors 
like participation in classroom discussion, clear enunciation, and skill in social 


There are two aspects to 
those that occur in school 


324. Evaluating Procedures, Products, and Typical Behavior 


relations can be readily observed and reliably rated. However, less = = 
types of behavior, such as ¿nterest in the opposite sex, feeling of ee 
and attitude toward minority groups, tend to be unreliably rated agentes 
presence must be inferred from outward signs which are indefinite, bagi 
and easily faked. Whenever possible, we should confine our ratings to 
characteristics which can be observed and judged directly. of 
3. Characteristics and points on the scale should be clearly defined. ee es 
the errors in rating arise from the use of general, vague trait characteriza a 
and inadequate identification of the scale points. The brief descriptions They 
with the descriptive graphic rating scale help overcome this epi 
not only clarify the meaning of the points on the scale but they also Sige rary 
to a fuller understanding of each characteristic being rated. Where it is ee af 
ble or inconvenient to use a descriptive scale, as on the back of a schoo! a ae 
card, a Separate sheet of instructions can be used to provide the desire 
havioral descriptions, rs 
4. Between os and seven rating positions should be provided a one 
ediate points. The exact number o x. x 
e is determined largely by the par 
permitting only crude judgments, 
sually no advantage in going beyon this, 
can we make finer discriminations than ween 
tions by allowing the rater to mark betv 


7-point scale, however. Only rarely 
and we provide for those few situa 
points if he so desires, 


” cient 
place to check “unable to judge” or araen for 
for each characteristic. Others provide a sp 


Kai cee Ë ither to justify the 
racteristic, where it is possible either 

given or to note the reason fo 
6. Ratings 


The pooled r 


Opportunity to observe” 


comments after each cha 


rating r not making a rating. 


ple. 
from several observers should be combined, wherever aee 
atings of several teachers will generally provide a more he eit? 
description of pupil behavior than that obtained from any one teacher. Papa 
aging ratings, the Personal biases of individual raters tend to cancel ea where 
out. Combined ratings are especially applicable at the high school level, umber 
specific teacher-pupil contact is limited but each pupil has classes with aie we 
of teachers, They are less feasible at the elementary school level, since lack of 
are apt to have only the ratings of the pupil’s one regular teacher. The nn 
additional raters at this level, however, is at least partially offset by the nee 
Opportunity for the teacher to observe his pupils in a Vege of ges s it 
thermore, the smaller number of elementary teachers in a school make: 


i rrate 
easier to detect and allow for common biases, such as the tendency to ove 
or underrate pupils. 


R 


Observational Techniques 325 
CHECKLISTS 


A checklist is similar in appearance and use to the rating scale. The basic 
difference between them is in the type of judgment called for. A rating scale 
provides an opportunity to indicate the degree to which a characteristic 
is present or the frequency with which a behavior occurs. The checklist, on the 
other hand, calls for a simple “yes-no” judgment. It is basically a method of 
recording whether a characteristic is present or absent, or whether an action 
was taken or not taken. Obviously, a checklist should not be used where degree 
or frequency of occurrence are important aspects of the appraisal. 

Checklists are especially useful in evaluating those performance skills that 
can be divided into a series of clearly defined, specific actions. A typical exam- 
ple of such a checklist is shown in Figure 15.5. This instrument makes it pos- 
sible to record the actions of a pupil as he attempts to locate an object under 
the microscope. On the first part of the form, the teacher is to indicate the 
Pupil’s sequence of actions by numbering them in the order in which they occur. 
In other places, he is to check phrases which characterize the pupil’s skill in 
using the microscope. This procedure requires the teacher to observe one pupil 
at a time and to record the actions as they occur. 

The form in Figure 15.5 illustrates the major points to consider in develop- 
ing a checklist for procedure evaluation. These may be summarized as follows: 


L Identify and describe clearly each of the specific actions desired in the 
performance. 
2. Add to the list those actions which represent common errors, if they are 
limited in number and can be clearly identified (for example, actions c and d 
in Figure 15.5) 


3. Arrange the desired actions and likely errors in the approximate order in 
which they are expected to occur. 


4. Provide a simple procedure for numbering the actions in sequence or for 
checking each action as it occurs. 


In addition to its use in procedure evaluation, the checklist can also be 
used in evaluating products. For this purpose, the form usually 
list of characteristics which the finished produc 
the product, the teacher simply checks whether each characteristic is present 
or absent. Before using a checklist for product evaluation, it should be decided 
that the quality of the product can be adequately described by 
the presence or absence of certain characterist 
indicated by noting the degree to which each c 
scale should be used instead of a checklist, 

In the area of personal-social development, the checklist can serv. 
venient method of recording evidence of growth toward specific le 
comes. Typically, the form lists the behaviors 
representative of the outcomes to be evaluated, 
ating pupils’ “concern for others” is show: 


consists of a 
t should possess. In evaluating 


te merely noting 
ics. If quality is more precisely 
haracteristic is present, a rating 

D 


e as a con- 
arning out- 


326 Evaluating Procedures, Products, 


oan ora 


s = >a 


pos3-7 


a 


a. 
b. 
c. 
d. 
e. 
f. 
ge 
h. 


Figure 15.5. Checklist 
Tyler, 
1930, 
sion.) 


and Typical Behavior 


STUDENT'S ACTIONS 
Takes slide 
Wipes slide with lens paper 
Wipes slide with cloth 
Wipes slide with finger 
Moves bottle of culture along 
the table 
Places drop or two of culture 
on slide 
Adds more culture 
Adds few drops of water 
Hunts for cover glasses 
Wipes cover glass with lens 
poper 
Wipes cover glass with cloth 
Wipes cover with finger 
Adjusts cover with finger 
Wipes off surplus fluid 
Places slide on stage 
Looks thru eyepiece with 
right eye 
Looks thru eyepice with 
left eye 
Turns to objective of lowest 
Power 
Turns to low-power objective 
Turns to high-power objective 
Holds one eye closed 
Looks for light 
Adjusts concave mirror 
Adjusts plane mirror 
Adjusts diaphragm 
Does not touch diaphragm 
With eye at eyepiece turns 
down coarse adjustment 
Breaks cover glass 
Breaks slide 
With eye away from eyepiece 
turns down coarse adjustment 
Turns up coarse adjustment a 
great distance 
With eye at eyepiece turns down 
fine adjustment a great distance 
With eye away from eyepiece 
turns down fine adjustment a 
great distance 


SKILLS IN WHICH STUDENT 
NEEDS FURTHER TRAINING 
In cleaning objective 

In cleaning eyepiece 

In focusing low power 

In focusing high power 

In adjusting mirror 

In using diaphragm 

In keeping both eyes open 

In protecting slide and ob- 
jective from breaking by 
careless focusing 


TITHE h 


g 


+ 


pi 


$ 


STUDENT'S ACTIONS (Continued) 


ah. 


me =o ens mo 


st 


Turns up fine adjustment screw 

a great distance 

Turns fine adjustment screw a 

few turns 

Removes slide from stage 

Wipes objective with lens poper 

Wipes objective with cloth 

Wipes objective with finger 

Wipes eyepiece with lens paper 

Wipes eyepiece with cloth 

Wipes eyepiece with finger 

Makes another mount 

Takes another microscope 

Finds object 

Pauses for an interval 

Asks, "What do you want me 

to do?" 

Asks whether to use high power 

Says, "I'm satisfied" 

Says that the mount is all right 

for his eye 

Says he cannot do it 

Told to start a new mount 

Directed to find object under low 
wer 

Directed to find object under high 

power 


NOTICEABLE CHARACTERISTICS 
OF STUDENT'S BEHAVIOR 

Awkward in movements 

Obviously dexterous in move- 

ments 

Slow and deliberate 

Very rapid 

Fingers tremble 

Obviously perturbed 

Obviously angry 

Does not take work seriously ⁄ 

Unoble to work without specific 

directions $ 

Obviously satisfied with his 

unsuccessful efforts 


CHARACTERIZATION OF THE 
STUDENT'S MOUNT 

Poor light 

Poor focus 

Excellent mount 

Good mount 

Fair mount 

Poor mount 

Very poor mount 

Nothing in view but a thread 

in his eyepiece 

Something on objective 

Smeared lens 

Unable to find object 


Sequence 


of 
Actions 


Sequence 
of 
Actions 


W. 
for evaluating skill in the use of the microscope. (From Palpa D 
“A Test of Skill in Using a Microscope,” Educational Research Bulletin, 9:49: 


; ; M iS 
Bureau of Educational Research and Service, Ohio State University. Used by perm 


Observational Techniques 327 


teacher writes a pupil’s name at the top of each column and periodically checks 
those behaviors in which growth has been noted. A more detailed account can 


also be kept on this form by dating each particular behavior observed in a 
pupil. 


Concern for Others 
Check each child two or three School 
times during the term to de- 
termine if growth has taken 
place. Date 


Names of Children 


Behavior to 
be observed 


Is sensitive to needs and 
Problems of others 


= 
- 
o 
Fi 
° 
= 
s 
3 
° 
° 
s 
3 
° 
° 
a 
& 


and solve problems 


5 
kA 
< 
= 
8 
f 
g 
a 
© 
8 
° 
5 


materials 


and help 


šla č 
Ë fe] 
š e 
Sis 
o z 
° a 
ë 
3] g 
i El 
° x 
= = 
= 
S 2 


suggestions 
and decisions 
Works courteously and 


appily with others 


to others 


Respects the property 
of others 


Enjoys group work 


Thanks others for 
elp 


= Ols 
< 
o 
2 
° 
3 
a 
° 
€ 
fo] 
a 
° 
š 
° 
z 


Sticks to group plans FE 


Commends others for 
Contributions 


Figure 15.6 Checklist for evaluating pupil’s “Concern for Others,” 
Michaelis, Social Studies for Children in a Democracy. Copyright 1963, 


(From John U. 
Inc, Englewood Cliffs, New Jersey. Used by permission.) 


by Prentice-Hall, 


The checklist is probably least useful in summa 


rizing a teacher 
. š A A °S ers gener. 
Impressions concerning the personality nt of pupils. In A al 


ity, and emotional il 

epa A stability. f 

example, it is seldom sufficient to note merely Riis y, for 
ple, ely whether the trait is Present or 


absent. Here, we are largely concerned with the degree to which iq 
ac 


and adjustme 


ae Rea H evaluati 
such characteristics as initiative, social matur: uating 


aracteristic 


328 Evaluating Procedures, Products, and Typical Behavior 


is present or the frequency with which certain behaviors occur. Since these 
finer discriminations are almost always possible, we should generally favor 
the rating scale in this area. Only where our appraisal is so rough that we are 


limited to a simple “present-absent” judgment should we resort to the use of 
a checklist. 


PUPIL PARTICIPATION IN RATING 


In this chapter, we have limited our discussion to observational methods used 
by the teacher. We purposely omitted those checklists and rating scales used 
as self-report techniques by pupils since these will be considered in the fol- 
lowing chapter. Before closing our discussion here, however, it should be pointed 
out that most of the devices used for recording the teacher’s observations can 
also be used by the pupil to judge his own progress. From an instructional 
standpoint, it is frequently useful to have a pupil rate himself (or his product) 
and then compare his rating with that of the teacher. If this comparison is made 
during an individual conference, the pupil and teacher can explore the reasons for 
each rating and discuss any marked discrepancies between the two sets of ratings. 

Self-rating by the pupil and a follow-up conference with the teacher has a 
number of possible benefits. It should help the pupil to (1) understand better 
the objectives of the course, (2) recognize more clearly the progress he is 
making toward the objectives, (3) diagnose more effectively his own particular 
strengths and weaknesses, and (4) develop increased skill in self-evaluation. 
Of special value to the teacher is the additional insight gained. He has an oppor- 
tunity to see how each pupil views his own learning and development in relation 
to the goals of the course. 

Pupil participation need not be limited to the use of the evaluation instru- 
ments. It is frequently desirable also to have pupils take an active part in the 
development of the instruments. Through class discussion, for example, they 
can help identify the qualities desired in a “good speech” or a “well-written 
report,” or the particular behaviors that characterize “sood citizenship.” A 
combined list of these suggestions can then be used as a basis for constructing 
a rating scale or checklist. Involving pupils in the development of evaluation 
devices has special instructional values, First, it directs learning by causing 
the pupils to think more carefully about the qualities to strive for in a per- 
formance or product. Second, it has a motivating effect, since pupils tend to 
put forth most effort when working toward goals they have helped to define. 


SUMMARY 


Observational techniques are especially useful in evaluating performance skills 
and certain aspects of personal-social development. In addition, the results of 
observation supplement and complement paper-and-pencil testing by indicating 
how pupils typically behave in natural situations. 

The least structured of the observational techniques is the anecdotal record. 
This is simply a method of recording factual descriptions of pupil behavior. 


Observational Techniques 329 


To make anecdotal record keeping feasible, it is usually necessary to restrict 
observations at any given time to a few types of behavior or to a few pupils. 
Anecdotal records possess the advantages of (1) providing a description of 
behavior in natural settings, (2) obtaining evidence of exceptional behavior 
which is apt to be overlooked by other techniques, and (3) being usable with 
the very young and the retarded. Their limitations are (1) the time and effort 
required to maintain an adequate record system, (2) the difficulty of writing 
objective descriptions of behavior, and (3) the problem of obtaining an ade- 
quate sample of behavior. These limitations can be minimized by following 
specific procedures for observing and recording the behavioral incidents. 
Suggestions for improving anecdotal records include: (1) determining in ad- 
vance what to observe, (2) describing the setting in which the behavior occurred, 
(3) making the record as soon as possible, (4) limiting each anecdote to a single 
incident, (5) separating factual description from interpretation, (6) recording 
both positive and negative incidents, (7) collecting a number of anecdotes 
before drawing inferences, and (8) obtaining practice in observing and record- 
ing pupil behavior. 

Rating methods provide a systematic procedure for obtaining and recording 
the judgments of observers. Of the several types of rating scales available, 
the -descriptive graphic scale seems to be the most satisfactory for school use. 
For some purposes, ranking methods also are useful. In the rating of procedures, 
products, and various aspects of personal-social development certain types of 
errors commonly occur. These include: (1) personal bias, (2) halo effect, 
and (3) logical errors. The control of such errors is a major consideration in 
constructing and using rating scales. Effective ratings result when we (1) 
select characteristics which are educationally significant, (2) limit ratings to 
directly observable behavior, (3) define clearly the characteristics and the 
points on the scale, (4) limit the number of points on the scale, (5) permit 
raters to omit ratings where they feel unable to judge, and (6) combine ratings 
from several raters, wherever possible. 

Checklists perform somewhat the same functions as rating scales. They are 
used in evaluating procedures, products, and aspects of personal-social develop- 
ment where an evaluation of the characteristics is limited to a simple “present- 
absent” judgment. 

Involving pupils in the construction and use of rating devices has special 
values from the standpoint of learning and aids in the development of self- 
evaluation skills. 


SUGGESTIONS FOR FURTHER READING 


Ahmann, J. S., M. D. Glock, and H. L. Warderberg. Evaluating Elementary School Pupils. 
Boston: Allyn and Bacon, 1960. Chapters 10-13. Describes and illustrates the use of 
observational techniques at the elementary school level. 

Cronbach, L. J. Essentials of Psychological Testing. New York: Harper & Row, 1960. Chap- 
ter 17: “Judgments and Systematic Observations.” 

Feldt, L. S. “The Reliability of Measures of Handwriting Quality,” Journal of Educational 
Psychology, 53, 288-292, 1962. 


330 Evaluating Procedures, Products, and Typical Behavior 


Schwartz, A., and S. C. Tiedeman. Evaluating Student Progress in the Secondary School. 
New York: Longmans, Green, 1957. Chapter 9: Good discussion of checklists and rating 
methods. 

Thomas, R. M. Judging Student Progress. New York: Longmans, Green, 1960. Chapter 8: 
“Observing Students.” Chapter 11: “Rating, Checking Student Skills and Products.” 
Chapter 16: “Developing Students’ Evaluation Skills.” 

Thorndike, R. L., and Elizabeth Hagen. Measurement and Evaluation in Psychology and 
Education. New York: John Wiley & Sons, 1961. Chapter 13: “The Individual as Others 
See Him.” 

Wrightstone, J. W., J. Justman, and I. Robbins. Evaluation in Modern Education. New York: 
American Book Company, 1956. Chapter 7: “Observation and Anecdotal Records.” Chap- 
ter 9: “Checklists and Rating Scales.” 


Chapter 16 
evaluating learning 
and development: 
peer appraisal 

and self-report 


Judgments and reports of pupils provide valuable information in many 
areas of learning and development: (1) Peer judgments . . . determined 
by sociometric procedures . . . are especially useful in evaluating personal- 
Social development. . . . (2) Self-report methods provide a fuller under- 
Standing of pupils’ needs, problems, adjustments, interests, and attitudes... 
aid in assessing learning readiness . . . in curriculum planning . . . in pupil 
guidance. 


: A teacher’s observations and judgments of pupil behavior are of special value 
ath those areas where the behavior is readily observable and the teacher’s train- 
ing and experience gives him special competence to judge. In evaluating the 
ability to operate a microscope or the quality of handwriting, for example, the 
teacher is unquestionably in the best position to make the judgment. He can 
directly observe the procedure, or the product resulting from the procedure, 
and his knowledge in the area contributes to the validity and reliability of the 
judgments. There are some areas of pupil development, however, where the 
teacher’s evaluation of behavior is apt to be inadequate unless his observations 
are supplemented and complemented by the judgments and reports of pupils. 

Various aspects of personal-social development can be more effectively 
evaluated by including peer ratings and other peer-appraisal methods in the 
evaluation program. In the realms of leadership ability, concern for others, 
effectiveness in group work, and social acceptability, for example, pupils fre- 


331 


332 Evaluating Procedures, Products, and Typical Behavior 


quently know better than the teacher each other’s strengths and weaknesses. 
The intimate interactions that occur in the give and take of peer relations are 
seldom fully visible to an outside observer. Some differences between teacher 
judgment and peer judgment can also be expected to occur because each is 
using different standards. Children’s criteria of social acceptability, for example, 
are apt to be quite different from the criteria used by adults. 

Self-report techniques are also a valuable adjunct to the teacher’s observa- 
tions of behavior. A complete picture of a pupil’s adjustments, interests, and 
attitudes cannot be obtained without a report from the pupil. His own expressed 
feelings and beliefs in these areas are at least as important as evidence ob- 
tained from observing his actual behavior. Although expressed feelings and 
observable behavior are not always in complete harmony, the self-report pro- 
vides valuable evidence concerning the pupil's perception of himself and how 
he wants others to view him. In fact, a discrepancy between reported feelings 
and actual behavior is, in itself, significant evaluative information. 

Though peer appraisal and self-report techniques are useful for understand- 
ing pupils better and for guiding their learning, development, and adjustment, 
the results should not be used for marking and reporting, or in any manner 
that interferes with honest responses. The pupils must be convinced that it is 
in their own best interests to respond as accurately and frankly as possible. A 
teacher who has good relations with his pupils and who has consistently em- 
phasized the positive values of the evaluation information should have no 


difficulty in obtaining the pupils’ cooperation in the effective use of these 
techniques, 


PEER APPRAISAL 


In some instances it is possible to have pupils rate their peers (fellow pupils) 
on the same rating device used by the teacher. At the conclusion of a pupil’s 
oral report before the class, for example, the other pupils could rate his per- 
formance on a standard rating form. The average of these ratings would pro- 
vide a good indication of how the group felt about the pupil’s performance. 
Except for oral reports, speeches, demonstrations, and similar situations where 
one individual performs at a time, however, the usual rating procedures are 
seldom feasible with pupils. If we ask pupils to rate their classmates on a series 
of personal-social characteristics, each pupil is required to fill out thirty or 
more rating forms. This becomes so cumbersome and time consuming that we 
could hardly expect the ratings to be diligently made. When peer ‘ratings and 
other methods of peer appraisal are used, we must depend on greatly simplified 
procedures. Some of the techniques are so simple that they can be used effec- 
tively with pupils at the primary school level. 

The most widely used techniques in this area include the following: (1) 


“guess who” technique, (2) sociometric technique, and (3) social relations 
scales. Each of these will be described in turn. 


Peer Appraisal and Self-Report 333 
“Guess Who” Technique 


One of the simplest methods of obtaining peer judgments is by means of the 
“guess who” technique. With this procedure, each pupil is presented with a 
series of brief behavior descriptions and asked to name those pupils who best 
fit each description. The descriptions may be limited to positive characteristics 
or they may also include negative behaviors. The following items, taken from 
a form for evaluating concern for others, are typical of the types of positive 
and negative descriptions used. 


1. Here is someone who is willing to share ideas and materials with others. 
2. Here is someone who does not care to share ideas and materials with 
others, 


Some teachers prefer to use only the positive behavior descriptions because 
of the possible harmful effects of negative nominations on group morale. Each 
individual teacher must make this decision for himself, however, since he is 
the only one in a position to determine what the effects might be on his pupils. 
Where good relations have been established among pupils and between teacher 
and pupils, this is not likely to be a problem. However, if doubt exists, it is 
usually better to sacrifice part of the evaluative data than to disrupt the morale 
of the class, 

In naming persons for each behavior description, the pupils are usually per- 
mitted to name as few or as many as they wish. Typical directions and sample 
items from a form for evaluating various personal-social characteristics are 
shown in Figure 16.1. The directions and behavior descriptions must, of course, 
be adapted to the age level of the pupils. With very young pupils, the technique 
can be presented as a guessing game with items stated as follows: “Here is 
Someone who talks a lot—guess who?” When the technique is used with older 
Pupils the “guess who” aspect is dropped and the pupils are merely told to write 
the names of those who best fit each behavior description. 

The “guess who” technique is based on the nomination method of obtaining 
Peer ratings and is scored by simply counting the number of mentions each 
Pupil receives on each description. If both positive and negative descriptions 
are used, such as friendly and unfriendly, the number of negative mentions on 
each characteristic are subtracted from the number of positive mentions. For 
example, 12 mentions as being friendly and 2 mentions of being unfriendly 
would result in a score of 10 on friendliness. The pattern of scores for each 
Pupil indicates the reputation he holds among his peers. This may not com- 
pletely agree with the teacher’s impressions of the pupil but it is nonetheless 
Significant information concerning personal-social development. In fact, one of 
the great values of this type of peer appraisal is that it makes the teacher aware 
of feelings and attitudes among pupils which he had been unable to detect 
through direct observation. 

This nominating method can be used to evaluate any aspect of personal-social 


334 Evaluating Procedures, Products, and Typical Behavior 


SOCIAL ANALYSIS OF THE CLASSROOM 
Directions 


Below are some word pictures of members of your class. Read each statement 


and write down the names of the persons whom you think the descriptions fit. 


REMEMBER: One description may fit several persons. You may write as many 


names as you think belong under each. 


The same person may be nominated for more than one description. 

Write "myself" if you think the description fits you. 

If you cannot think of anyone to match a particular description, go on to 
the next one. 

You will have as much time as you need to finish. Do not hurry. 

NOW YOU ARE READY TO BEGIN. 


3. Here is someone who likes to talk a lot, always has something to say. 


4. Here is someone who doem't like to talk much, is very quiet, even when 


nearly everyone else is talking. 


This is someone who is always cheerful, jolly, and good-natured, who 


laughs and smiles a good deal. 


Here is someone who always seems rather sad, worried, or unhappy, who 


hardly ever laughs or smiles. 


Here is someone who is very friendly, who has lots of friends, who is 


nice to everybody. 


Here is someone who does not care to make friends or who is bashful 


about being friendly, or who does not seem to have many friends. 


Figure 16.1. Sample items from a “Guess Who” form used to evaluate various personal- 
social characteristics. (From Ruth Cunningham, Understanding Group Behavior of Boys 


and Girls, Copyright 1951 by Bureau of Publications, Teachers College, Columbia University- 
Used by permission.) 


Peer Appraisal and Self-Report 335 


development for which pupils have had an adequate opportunity to make 
observations. It is especially valuable for appraising personality characteristics, 
character traits, and social skills, but it is not limited to these areas. Figure 
16.2 contains a list of “guess who” statements which were used to evaluate five 
different dimensions of creative thinking.! The dimension each item attempted 
to measure is indicated in parentheses following the question. As with other 
evaluation techniques, the specific items used in any particular “guess who” 
form should be derived directly from the objectives to be evaluated. 


Who in your class comes up with the most ideas? (Fluency) 

Who has the most original or unusual ideas? (Originality) 

IF the situation changed or if a solution to a problem wouldn't work, 
who in your class would be the first to find a new way of meeting 

the problem? (Flexibility) 

Who in your class does the most inventing and developing of new ideas, 


gadgets, and such? (Inventiveness) 


Who in your class is best at thinking of all the details involved in 


working out a new idea and thinking of all the consequences? 


(Elaboration) 


Figure 16.2. Sample “Guess Who” items for evaluating aspects of creative thinking ability. 
(From E. Paul Torrance, Guiding Creative Talent. Copyright 1962 by Prentice-Hall, Inc. 
sed by Permission.) 


The main advantage of the “guess who” technique is its usability. It can be 
administered in a relatively few minutes, to pupils of all age levels, and scoring 
a a simple matter of counting the number of nominations received. Its main 
limitation is the lack of information it provides on the shy, withdrawn pupil. 
Such pupils are frequently overlooked when nomination methods are used. In 
effect, they have no reputation in the peer group and are simply ignored during 
the rating process. 


Sociometric Techniques 


The sociometric technique is a method for evaluating the social acceptance 
of individual pupils and the social structure of a group. It is a relatively simple 
technique, based on pupils’ choices of companions for some group situation or 
activity, A typical sociometric form is shown in Figure 16.3. This form was 
used to measure pupils’ acceptance as seating companions, work companions, 


wie P. Torrance, Guiding Creative Talent (Englewood Cliffs, New Jersey: Prentice-Hall, 


336 Evaluating Procedures, Products, and Typical Behavior 


and play companions at the later elementary school level. The directions illus- 
trate several important principles of sociometric choosing. (1) The choices 
should be real choices which are natural parts of the ongoing activities in the 
classroom. (2) The basis for choice and the restrictions on the choosing should 
be made clear. (3) All pupils should be equally free to participate in the activity 
or situation. (4) The choices each pupil makes should be kept strictly confiden- 
tial. (5) The choices should be actually used to organize or rearrange groups. 
More spontaneous and truthful responses can be expected where the pupils 
know that their choices will be put into effect. 

School activities abound with possibilities for sociometric choosing. Pupils 
can choose laboratory partners, fellow committee members, companions for 
group projects, and the like. Although some differences in choice can be 
expected from one choosing situation to another, a large element of social 
acceptance runs through all choices. A pupil who is highly chosen for one 
activity will also tend to be highly chosen for other activities. Greatest varia- 
tion in the choosing occurs where very specific activities are used and where 
skill and knowledge play a prominent role in successful performance. Even a 
relatively unpopular pupil might be highly chosen as a team mate for baseball 
if he is an exceptionally good player. It is unlikely that he would be chosen as 
seating companion in the classroom, however, since this is almost a pure meas- 
ure of social acceptance. 

There is some disagreement among sociometric experts concerning the 
desirability of asking pupils to name also those whom they would not want as 
a companion. The arguments in favor of such negative choices are that rejected 
pupils can be identified and helped, and that interpersonal friction can be 
avoided in arranging groups. The counterargument is that such questions make 
pupils more conscious of their feelings of rejection and that this may disturb 
both group morale and the emotional development of pupils. The safest pro- 
cedure seems to be to avoid the use of negative choices unless they are abso- 
lutely essential to the purpose for which the technique is being used. Where their 
use is essential, the approach should be casual and the pupils permitted, rather 
than required, to make such choices. A statement like the following will ordi- 
narily suffice: “If there are pupils you would rather not have in your group, 
you may also list their names.” 

It is usually desirable to restrict the number of choices each pupil makes on 
a sociometric question. For most purposes, five choices for each activity is a 
suitable number. Sociometric results have been shown to increase in reliability 
up to five choices, with no increase beyond that number.” Also, five choices 
makes it easier to arrange sociometric groups since it is sometimes difficult to 
satisfy the first several choices for all pupils. At the lower-elementary grades, it 
is usually necessary to limit the choices to two or three. Very young children 
find it difficult to discriminate beyond this number. 

Tabulating Sociometric Results. The pupils’ sociometric choices must be 


2 N. E. Gronlund, Sociometry in the Classroom (New York: Harper & Row, 1959). 


Peer Appraisal and Self-Report 337 


Name Date 


During the next few weeks we will be changing our seats around, working in small groups 
and playing some group games. Now that we all know each other by name, you can help me 
arrange groups that work and play best together. You can do this by writing the names of the 
children you would like to have sit near you, to have work with you, and to have play with 
you. You may choose anyone in this room you wish, including those pupils who are absent. 
Your choices will not be seen by anyone else. Give first name and initial of last name. 

Make your choices carefully so the groups will be the way you really want them. | will try 
to arrange the groups so that each pupil gets at least two of his choices. Sometimes it is hard 
to give everyone his first few choices so be sure to moke five choices for each question. 


Remember! 
Your choices must be from pupils in this room, including those who are absent. 


You should give the first name and the initial of the last name. 
You should make all five choices for each question. 

You may choose a pupil for more than one group if you wish. 
Your choices will not be seen by anyone else. 


| would choose to sit near these children: 


5. 


———..mL 


| would choose to work with these children: 


3. 


5. 


— ëlël.ll. CC 


| would choose to play with these children: 


Tigure 16.3. Illustrative sociometric form. (From Sociometry in the Classroom. Copyright 
59 by Norman E. Gronlund.) 


organized in some fashion, if we are to interpret and use them properly. A 
Simple tally of the number of choices each pupil receives will indicate degree 
of social acceptance, but it will not provide information concerning who made 
the choices, whether two pupils chose each other, and what the social structure 
of the group is like. A complete record of the sociometric results can be obtained 
by tabulating the choices in a matrix table like the one shown in Figure 16.4. 
Note that the pupils’ names are listed down the side of the table and are num- 
= epiko These same numbers, corresponding to The pupils’ names, 

cross the top of the table so that each pupil’s choices can be 


338 Evaluating Procedures, Products, and Typical Behavior 


recorded in the appropriate column. For example, the choices of John A. were 
as follows: 


Chose Rejected 
1. Bill H. X Henry D. 
2. George L. X Bob F. 
3. Mike A. 
4. Betty A. 
5. Pete V. 


These choices were recorded in the table to the right of John A.’s name by 
placing number 1 in column six to indicate Bill H. as his first choice, number 2 
in column seven to indicate George L. as his second choice, and so on. The X's 
represent rejection choices and the circled numbers in the table indicate mutual 
choices. Note, for example, that Mike A. (number 2) chose Jim B. (number 3), 
and vice versa. Mutual choices are always an equal number of cells from the 
diagonal line, in each corresponding column and row. 

In this particular tabulation form, the boys and girls are listed separately. 
This divides the main part of the table into four quarters. The boys’ choices of 
boys fall in the upper left-hand quarter of the table and the girls’ choices of 
girls fall in the lower right-hand quarter. The diagonal line, which goes through 
the empty cells that are unused because pupils do not choose themselves, cuts 
through these two quarters. The upper right-hand quarter and the lower left- 
hand quarter, then, contain only cross-sex choices. This division of the table 
makes the number of choices given to the same sex and to the opposite sex 
readily apparent and easy to summarize. 

In totaling the number of choices received, in Figure 16.4, each choice was 
given a value of one regardless of level of choice. Some teachers prefer to 
weight the choices so that a first choice counts more than a second choice, and 
so on, but there is no rational basis for assigning such weights. Various arbi- 
trary weighting systems have been tried but none has been shown to be superior 
to the method used here.? While it seems sensible to expect a pupil’s first choice 
to have greater significance than his second, the degree to which choices differ 
cannot be predicted. One pupil may have a strong first preference, while another 
is equally attracted to several friends and finds it difficult to discriminate among 
his first several choices. Until a weighting system is found that handles such 
.discrepancies, the simpler method of counting one for each choice should be 
used. The level of choice should still be recorded in the matrix table, however, 
since it is useful when the choices are used to organize groups. 


The number of choices a pupil receives on a sociometric question is used as 
an indication of his social acceptance by peers. Where five choices are used, as 
in Figure 16.4, pupils who receive nine or more choices are called stars. Those 
who receive no choices are called isolates, and those whe receive one choice 
are called neglectees. The remaining pupils, who fall somewhere above or 
below average, are given no special name. If a pupil receives only rejection 


‘ Ibid. 


('punjuoi5 


*H UBUON Aq 6S6I 1ugriZdoo) woos) ay) UI K(qəut0190ç wor) `suorugduro2 Y10M Jo suonoəfər pue Səəorouo Burwous qe} XEN “POL MIHI 


xəş əjısoddO — sO 
5921045 


panleoay 
592104 


*M 19105 


xag əuoç — SS :3ION 


(1942031) 
BunoA *y*4 
(s13) 

WS 
ooy2s) 
[p14u39 


*S 42d 


"y ang 


kl 


*W Ə!6iow 


` 3 uuy 


*f uojous 


"5 8107 


PE 


°g vary | 


`V Zuəg 


`v kow 
“A ated 


`d 2/90 


wnl[t{[o}ole[s fol o}s] sols 


z! 


‘N pra 


-lol—|nlo]—-|-—Jo|a|-]-|o|— 
ol-lololo]—|—-|]—Jolo]-|joj— 


H 


oļjoļjoļjojoj-jojo 


*1 351099 


*4 qog 


*q Away 


tle] ajo] 
H-|-|m]|o 


°g Gur 


7 


ojojojoja 


`V aW 


OZ | ól 


`V uqor 


aly 

-jo 
s 

S]ololo 


awoN 


uəsouO s|!dng 


sa21049 


suonoalay 


340 Evaluating Procedures, Products, and Typical Behavior 


choices, he is called a rejectee. As noted earlier, where pupils choose each other 
they are called mutual choices. This terminology is standard in describing and 
interpreting sociometric results. 

The Sociogram. The matrix table is useful for organizing sociometric data 
for future use and for determining the social acceptance of individual pupils. 
It does not provide a clear picture of the social structure of the group, how- 
ever. Where this is desired, the sociometric results are presented in the form of 
a sociogram. This is a graphic picture of the social relations existing in a 
group and it may be plotted directly from the data recorded in the matrix table. 
A typical sociogram is shown in Figure 16.5. The sociometric data depicted 
here were taken from Figure 16.4. 

Note that the concentric circles form a target-type diagram on which to plot 
the sociometric data. Pupils in the star category (9 or more choices) are placed 
in the center of the target; isolates are placed in the outer ring; and the remain- 
ing pupils are placed between these extremes in terms of the number of choices 
received. The boys are represented by triangles and the girls by circles, with 
the numbers corresponding to each pupil's number in the matrix table (see Fig- 
ure 16.4.). The uncluttered appearance of this sociogram is due to the fact 
that the use of lines is confined to mutual choices and rejections. Plotting all 
choices would result in such a maze of lines that the sociogram would be impos- 
sible to interpret. 

In constructing a sociogram, it is helpful to start with the most highly chosen 
pupils and work out from the center of the diagram. Pupils with mutual choices 
should be placed near each other. The original placement on the chart should 
be done lightly in pencil since considerable rearrangement is necessary during 
the plotting to minimize the number of crossed lines. Plotting boys and girls 
on separate sides of the diagram also simplifies the process since the number 
of mutual cross-sex choices is usually small. When all pupils have been finally 
arranged on the diagram, check to be certain that each pupil is still in the 
proper position between the concentric circles, since this indicates the approxi- 
mate number of choices he has received. 

The sociogram in Figure 16.5 illustrates the common social configurations 
you can expect in group structure. Girls number 12, 11, 14, 18, and 20 form a 
very cohesive clique. Girls number 13, 16, and 17 form a triangle. Boys number 
10, 8, and 6 form a chain of mutual choices and boys number 4 and 5 form a 
mutual pair. In addition, there seems to be a social cleavage between boys and 
girls, except for the few mutual cross-sex choices of boy number 6. Pupils 1, 
15, and 19 are isolated from the group and pupil number 19 is actively rejected 
by four of her classmates. 

While sociograms depict in graphic form the social relations present in the 
group, they do not indicate why a particular social structure evolved nor what 
should be done, if anything, to change it. The sociogram is merely a starting 
point. To understand the cliques, cleavages, and social positions of individual 
pupils, it is necessary to supplement sociometric data with information obtained 
from observation, guess who techniques, and various other evaluation methods. 


Peer Appraisal and Self-Report 341 


School Central 


Boy Class 5A 


Mutual choice Teacher F. R. Young 
Rejection Date 4/25/57 


igure 16.5. Sociogram depicting choices and rejections of work companions, based on 


in Figure 16.4. (From Sociometry in the Classroom. Copyright 1959 by Norman E. 
ronlund.) 


Uses of Sociometric Results. The sociometric technique has been used 
for a variety of purposes in the school, Its major uses include: (1) organizing 
classroom groups, (2) improving the social adjustment of individual pupils, 
(3) improving the social structure of groups, and (4) evaluating the influence 
of school practices on pupils’ social relations. Each of these uses will be briefly 
discussed, 
ag step in using sociometric results is to put the choices into effect. For 
a P a committees are to be formed, they should be patterned as closely 
eee e to the pupils’ choices. This can usually be done most effectively by 

g with the isolates and working toward the pupils receiving the largest 


342 Evaluating Procedures, Products, and Typical Behavior 


number of choices. With five choices, it is usually possible to satisfy at least 
two choices for each pupil. By starting with the isolates, you are able to give 
them their two highest choices. This places them in contact with the pupils with 
whom they have the best chance for developing social relations. In arranging 
the groups, the sociometric choices should, of course, not be followed blindly. 
It may be desirable to place rural and urban pupils together to reduce an 
undesirable cleavage, or to separate the members of a clique which has been 
disrupting to the class. These adjustments can and should be made without 
violating your promise to give all of the pupils some of their choices. 

Although sociometric results do not indicate how to improve social adjust- 
ment, they do aid in identifying those pupils who are having difficulty in adjust- 
ing to the peer group. Isolated and rejected pupils are not apt to improve their 
social position without special help. If an isolated pupil is new to class, arrang- 
ing opportunities for social contact may be all that is needed. In other cases; 
it may be a matter of helping an isolated pupil improve his appearance, social 
skills, and apparent value to the group. In some instances, the pupil may be 
so socially withdrawn or aggressive toward others that the assistance of the 
parents, the school counselor, and other special personnel may be required. 
Specific remedial procedures should be based on the causes of the pupil’s isola- 
tion or rejection. The sociometric results merely alert us to the pupils most in 
need of further study and possible remedial action. 

Sociometric measurement can contribute to the improvement of group struc- 
ture in two ways. First, it helps clarify the cliques, cleavages, and mutual rela- 
tions present in a group. Second, it provides the basic data for rearranging the 
group in a manner which is likely to result in a more cohesive social pattern. 
A distintegrated classroom structure, characterized by an overabundance of 
cliques, cleavages, and isolated pupils, commonly results in low group morale 
and a poor climate for learning. It is not expected that a simple rearranging 
will eliminate cleavages along racial, religious, rural-urban, or socio-economic 
lines. However, it does provide opportunity for interpersonal contacts and, if 
accompanied by other efforts to improve relations, these cleavages can be 
appreciably diminished.* 

In addition to its more common uses in the classroom, sociometric measure- 
ment is also useful in evaluating the effect of particular school practices on 
pupils’ social relations. For example, it can be used to help answer questions 
like the following: Is our method of ability grouping creating a social cleavage 
between gifted and regular pupils? Does the competition we have built up 
between boys and girls in our elementary school result in an extreme sex 
cleavage? How does our activity program, which prevents bus-transported pupils 
from participating fully, influence the social relations between town and rural 
youth? These and similar questions have been studied in schools with the aid 
of sociometric measures.” The simplicity of the technique makes it usable with 
an entire school population as well as with a single classroom group. 

+ N. E. Gronlund, Sociometry in the Classroom (New York: Harper & Row, 1959). 


5 M. E. Bonney, “Sociometric Methods,” Encyclopedia of Educational Research, 3rd edi- 
tion (New York: Macmillan, 1960). 


Peer Appraisal and Self-Report 343 


Social Relations Scales 


Adaptations of the sociometric technique have appeared in various forms. 
One of the most recent is that of the Syracuse Scales of Social Relations. This is 
a series of published scales designed for use at three levels: elementary (grades 
1-6), junior high (grades 7-9), and senior high (grades 10-12). At each level 
there are two scales. These scales present pupils with social situations which 
serve as a basis for rating all of their classmates. In contrast to the traditional 
sociometric procedure, which calls for choices in terms of particular activities, 
peer ratings with these scales are based on the extent to which class members 
satisfy certain psychological needs. For example, on the first scale, used at all 
three levels, the pupil is asked to rate his classmates as kind, sympathetic friends 
with whom he would talk over his troubles when unhappy (need: succorance). 
The second scale, which varies from one level to another, requests ratings on 
the following types of situations: 


Elementary—Someone to help him do something real well so that people will praise him 


(need: achievement-recognition) . 
Junior High—Someone he admires as an ideal (need: deference). 
Senior High—Someone he would enjoy being with at a party (need: playmirth) . 


The procedure for responding to the situation on each scale has been made 
standard by means of carefully developed directions. In essence, these directions 
include three basic steps. First, each pupil is requested to select representative 
persons for five points on the scale, ranging from most liked for the situation 
to least liked. These five key persons may be anyone whom the pupil has ever 
known, Second, the pupil is asked to place each classmate on the five-point scale 
by indicating which key person each classmate is most like. Third, he is to indi- 
cate whether the classmate being rated is better than, the same as, or less good 
than, the key person to whom he is being compared. 

A sample form used to explain the procedure to the pupils is shown in Fig- 
ure 16.6. The names in the boxes across the bottom of the form indicate the 
Persons John Smith has selected for the key points on this particular scale 
(need: succorance). Note that he most prefers his mother for this situation, 
least prefers Alice, and that Dan, Uncle John, and Neighbor Jones are placed 
at intermediate points. These five persons, then, serve as John Smith’s reference 
Points for rating each of his classmates. 

For scoring purposes, each scale position is assigned a numerical value. The 
median rating a pupil receives from his classmates on each scale indicates his 
reputation as a satisfier of that particular need, and the median rating he gives 
his classmates indicate how he perceives them. Norms are also provided with 
the scales, making it possible to compare a pupil’s scores with a relatively large 
number of children at his own grade level. In addition, it is possible to compute 
class averages which serve as indices of group morale. 

_ Although these scales lack the flexibility of the traditional sociometric tech- 
Nique and do not provide data for the sociometric rearrangement of groups, 
they have the advantage of standard procedures for administration and scoring. 


344 Evaluating Procedures, Products, and Typical Behavior 


This provides greater comparability of results from one grade level to the next 
and makes possible a cumulative record of social growth. As with all sociometric 
procedures, however, care must be taken in interpreting the results. Social 
relations are a function of both the individual and the group, so we can expect 
some variation in an individual’s scores from one group to another. The infor- 
mation provided by these scales will be most valuable where it is verified and 


supplemented by other types of data for evaluating the personal-social develop- 
ment of pupils. 


a | 


. 
Xx Tom 


x Mr. White 


° 
Oscar 


fe UNCLE 
JOE 


Fay 


K a R Pa N i R vomer 
a 


ae 


= T 
L aie ] [Aeghlor once [Grete e] [ x= a = Eä 


LEAST Ya WAY BETWEEN MEDIUM 


Ya WAY BETWEEN MOST 


Figure 16.6. Sample form. The Syracuse Scales of Social Relations: Senior High Leuel, 
by Eric F. Gardner, and George F. Thompson. (Copyright 1958 by Harcourt, Brace & 
World, Inc., New York, N.Y. All rights reserved. Reproduced by permission.) 


SELF-REPORT TECHNIQUES 


In general, there are two types of information that may be profitably obtained 
by self-report techniques. These are: (1) information concerning the pupil’s 
past behavior, such as the books he has read, the hobbies he has engaged in, 
and the experiences he has had in a particular area; (2) information concern- 
ing the pupil’s inner life, such as his worries and concerns, his feelings toward 
himself and others, his interests, and his opinions. Both types of information 
are typically inaccessible by other means—the first because it deals with past 
behavior no longer observable, and the second because it is concerned with 
behavior not readily discernible to an outside observer. 


Of the various methods of obtaining information directly from an individual, 


Peer Appraisal and Self-Report 345 


the oldest and best known is that of the personal interview. The face-to-face 
contact provided by the interview gives it several advantages as a self-report 
procedure. First, considerable flexibility is provided. The interviewer can 
clarify his questions if they are not readily understood, he can pursue promising 
lines of inquiry, and he can provide the interviewee an opportunity to qualify 
or expand on his answers, as needed. Second, the interviewer can carefully 
observe the interviewee during the session, noting the amount of feeling attached 
to his answers, the topics on which he seems to be evasive, and the areas in 
which he is most expansive. Third, the interview makes possible not only col- 
lecting information from an individual but also sharing information with him 
and, as in the case of the counseling interview, using the face-to-face contact 
as a basis for therapy. 

The personal interview would provide an almost ideal method of obtaining 
self-report information from pupils except for two serious limitations. It is 
extremely time consuming and the information provided is not standard from 
One person to another. In the interests of both feasibility and greater compar- 
ability of results, the self-report inventory or questionnaire is commonly used 
in place of the personal interview. An inventory consists of a standard set of 
questions pertaining to some particular area of behavior, administered and 
Scored under standard conditions. It is a sort of standardized, written inter- 
view which makes possible the collection of a large amount of information 
quickly and an objective summary of the data collected. 

The effective use of self-report inventories assumes that the individual is both 
willing and able to report accurately. Responses can usually be easily faked 
if an individual desires to present a distorted picture of himself. Even where he 
Wants to be truthful, there is the possibility that his recollection of past events 
will be inaccurate and that his self-perceptions will be biased. These limitations 
fant he partly offset by using self-report inventories only in those areas where 
Pupils have little reason for faking, by emphasizing the value of frank responses 
for self-understanding and self-improvement, and by taking into account the 
Possible presence of distortion when interpreting the results. As we shall see 
shortly, the problem of obtaining accurate responses varies considerably from 
one type of self-report inventory to another. 


Activity Checklists 


Pupils have numerous incidental and informal learning experiences which 
ave implications for classroom instruction. For example, they read books and 
Magazines, watch television, play games, have hobbies, belong to social and 
Special interest clubs, and engage in various scientific, literary, and artistic 
activities on their own. A survey of such activities is frequently desirable in 
assessing pupil readiness for new learning experiences and for general cur- 
riculum planning. The activities a pupil has engaged in also provide clues to 
"s his creative growth, and his potential for development in various 
alee vocational areas. In Project TALENT, one of the most compre- 
ies of high school pupils’ aptitudes and abilities ever attempted, 


346 Evaluating Procedures, Products, and Typical Behavior 


information concerning personal experiences was considered so important that 
80 minutes of testing time was allotted to the task of obtaining it.° 

Most activity checklists are constructed by the teacher, or researcher, to fit 
some specific purpose. A portion of a checklist used by Torrance’ to study the 
development of creativity in children is presented in Figure 16.7. The complete 
checklist contains 100 items and includes activities related to language arts, 
science, social studies, art, and other fields. 


THINGS DONE ON YOUR OWN 
Name Grade School Date 


DIRECTIONS: Below is a list of activities boys and girls sometimes do on their own. Indicate 
which ones you have done during this school term by checking the blank at the left. Include 


only the things you have done on your own, not the things you have been assigned or made to 
do. 


( ) 1. Wrote a poem 

( ) 2. Wrote a story 

() 3. Wrote a play 

() 4. Kept a collection of my writings 

() 5. Wrote a song or jingle 

( ) “é. Produced a puppet show 

( ) 7. Kept a diary for at least a month 

() 8. Played word games with other boys and girls 
a) 
(3) 


Used Roget's Thesaurus or some other book in addition to a dictionary 


10. Recorded on a tape recorder an oral reading, dialogue, story, discussion, or the 
like 


11. Found errors in fact or grammar in newspaper or other printed matter 
12. Acted in play or skit 


) 
) 
) 13. Directed or organized a play or skit 
) 14. Made up and sang a song 

) 

) 


15. Made up a musical composition for some instrument 
16, Made up a new game and taught it to someone else 


Figure 16.7. Portion of a creative activiti 


es checklist. (From E. Paul Torrance, Guiding 
Creative Talent. Copyright 1962 by Prentice. 


-Hall, Inc. Used by permission.) 


Another example of a Things Done checklist is shown in Figure 16.8. This 
is a portion of a 243-item inventory concerned with scientific activities which 
could have been done or participated in by sixth- and seventh-grade pupils. 
This particular checklist was designed to aid in the identification of potential 
scientific and technical talent. It is based on the hypothesis that pupils who 
have done many things of a scientific nature will continue in these activities an 
will tend to become scientists. The score on this inventory is simply the number 
of items checked. The manual also suggests that a tabulation of the number of 
pupils checking each item is useful in planning science related activities and 
in discussing science interests with parents. 


SJ. C. Flanagan and others, 
Mifflin, 1962. 

+ E. P. Torrance, 
1962). 


Design for a Study of American Youth. Boston: Houghton 


Guiding Creative Talent (Englewood Cliffs, New Jersey: Prentice-Hall, 


Peer Appraisal and Self-Report 347 


+ read book on astronomy used a lightmeter 
explored a cave held metallic mercury in hand 
read book on biology tested milk for butterfat 
visited a zoo made soap 
tead magazine about science seen electroplating being done 
+ embedded insects in plastic tested soap 
+ used a tuning fork grown crystals 


embedded flowers in plastic . learned the names of geologic eras 
seen a radar screen collected fossils 

taken a square root of a number grown plants in a sealed jar 

used an ammeter read the history of a river or lake 
seen a star map collected seeds 

used a voltmeter read about heredity 

read a star map studied plant growth 


read a barometer made a collection of different kinds of nuts 


Figure 16.8. Sample items from Science Background, 1A, Things Done, checklist. (Copy- 
right 1957 by Science Service, Inc., 1719 North Street, N.W., Washington 6, D.C. Used by 
Permission.) 


Use of checklists for identifying creative and scientific talent is still very much 
in the experimental stage. The contribution of such inventories to an appraisal 
of pupil readiness and to curriculum planning, however, is rather self-evident. 
The more we know about the past learning experiences of pupils the better we 
can plan their future experiences. Although answers to such checklists can be 
easily faked, there is little reason for pupils to do so. Emphasizing that the 
Tesults will be used solely for the planning of future learning activities should 


be Sufficient to elicit honest responses. 


Problem Checklists 


The most comprehensive and widely used checklists have been in the area 
of personal-social adjustment. Typically these checklists contain a collection 
of several hundred problems common to children and youth. The pupil reads 
through the list and checks those problems which are of concern to him. The 
marking of the items usually permits the pupil to indicate also which problems 
© considers most serious. Sample items from the Mooney Problem Checklists 
are shown in Figure 16.9. There are 210 items in the junior high school forms 
and 330 items in the high school and college forms of these checklists. They 
Cover such areas as health and physical development, school, home and family, 
money, work, the future, personal concerns, and various types of personal-social 
relations, The original lists of problems were selected from a master list of 
Over 5,000 items. 
A similar type of checklist for elementary i ° 
"ventory. This is designed for use in grades 4 to 8. Sample items shown in 
igure 16.10 indicate the nature of the problem statements. The areas covered 
are: About Me and My School, About Me and My Home, My Health, About 

yself, Getting Along with Other People. and Things in General. Note that the 
©xes following each item make it possible for the pupil to indicate how he 
Perceives the size of the problem. There are 168 items in this particular check- 


school pupils is the SRA Junior 


348 Evaluating Procedures, Products, and Typical Behavior 


First Step: Readthe list slowly, and as you come to a problem which troubles you, underline it. 


Being underweight Frequent headaches 

Being overweight Weak eyes 

Not getting enough exercise Often not hungry for my meals 
Getting sick too often 59. Not eating the right foods 


Tiring very easil 60. Gradually losing weight 
ry y y 


Needing to learn how to save money 61. Too few nice clothes 

Not knowing how to spend my money wisely 62. Too little money for recreation 
Having less money than my friends have 63. Family worried about money 

Having to ask parents for money 64. Having to watch every penny | spend 
Having no regular allowance (or income) 65. Having to quit school to work 


Figure 16.9. Sample items from the Mooney Problem Check Lists. (Reproduced by per- 


mission. Copyright © 1950, The Psychological Corporation, New York, N.Y. All rights 
reserved.) 


list. A high school form, entitled SRA Youth Inventory, is also available. It 
contains 296 problem statements and is suitable for grades 7 to 12. 

Problem checklists can be used for several purposes, They can help to (1) 
identify pupils who are most likely in need of counseling or other personal 
help; (2) identify the most common problems within a group as a basis for 
group guidance and curriculum planning; (3) improve the effectiveness of 
the curricular, extracurricular, and guidance programs of the school in meeting 
the needs of the pupils. If a tabulation of responses indicates that most pupils 
are troubled by ineffective study habits, for example, steps can be taken to 
remedy the situation. 

When used to screen individuals for referral to the school counselor, it is, 


of course, necessary for the pupils to provide their identity. If administered 
solely as a basis for curriculum planning and for making revisions in the 
schools’ services, however, it may be desirable to obtain anonymous responses. 
When signatures are used, only those pupils who want help with their personal 
problems can be expected to answer with complete frankness. The protection of 


anonymity should provide a much more complete and accurate survey of the 
problems experienced by all pupils. 


Personality Inventories 


The personality or adjustment inventory is similar to the problem checklist. 
Instead of a list of problems to be checked, however, the typical personality 
inventory presents pupils with a series of questions like those used in a psy- 


chiatric screening interview. For example, an inventory might include items 
like the following: 


Do you daydream often? 

Are you frequently depressed? 

Do you have difficulty making friends? 
Do you usually feel tired? 


Peer Appraisal and Self-Report 349 


2 “no,” or 


Responses to such questions are commonly indicated by circling “yes, 
“2” (for uncertain). In some instances a forced-choice procedure is used. That 
is, the items are paired and the respondent must indicate which of the two 


statements is most characteristic of him. 


What things do you wish you could do? What things would you like to know more about? 
What things worry you, and keep you from being os happy as you would like to be? 


In this booklet you will find a list of many interests and problems of young people your age. 


Oooo 
Oooo 


As you mark each item in the list, use the three boxes to show the way you really feel about it. 


KEN Koo 


O 
= 
Put on X in the MIDDLE-SIZED BOX if it is a MIDDLE-SIZED PROBLEM for you fa LJ O 
(Jomo 
@ 


Put an X in the CIRCLE if it is NOT A PROBLEM for you » + ++ sere Oog 


Marking the items in this booklet should help you to understand your own interests and roblems 
e! . 


Figure 16.10. Sample items from the SRA Junior Inventory. (Reprinted by permission of 
Science Research Associates, Inc., from SRA Junior Inventory, Form S, by H. H. Remmers 
and Robert H. Bauernfeind. © 1957, Purdue Research Foundation.) 


The items look like this: 


| want to learn how to read betters s s eee reer 


l wish | hod more "pep" 2 ee sss sss se 


Put an X in the BIG BOX if it is o BIG PROBLEM for you + + + + 


Put on X in the LITTLE BOX if it is just o LITTLE PROBLEM for you. «+ + + = 


Personality inventories vary considerably in the type of score provided. Some 


Provide a single adjustment score; others have separate scores for particular 
adjustment areas (e.g., health, social, emotional, and so on) or for specific 
Personality traits (e.g., self-confidence, sociability, ascendance, and so on). 
In general, research has not supported the validity of separate scores for evalu- 
ating adjustment by means of inventories. Even the use of the total score for 
distinguishing between adjusted and maladjusted individuals has been seri- 
ously challenged.’ | 
All of the limitations of the self-report technique tend to be accentuated in 
the personality inventory. (1) The replies can be easily faked and the threaten- 
ing nature of many of the questions provide motivation for presenting a distorted 
picture. Some inventories provide “control keys” to detect faking and others 
reduce it by means of the forced-choice procedure, but faking cannot be entirely 
eliminated. (2) In addition to honesty, accurate responses require good self- 
insight. This is the very characteristic that poorly adjusted individuals are apt 
to lack. They are prone to excessive use of adjustment mechanisms which tend 
to distort their perceptions of themselves and of their relations to others. (3) 


SA, Ellis, “Recent Research with Personality Inventories,” Journal of Consulting Psy- 
chology, 17, 45-49, 1953. 


350 Evaluating Procedures, Products, and Typical Behavior 


The ambiguity of the items is also apt to introduce error into the results, Ques- 
tions like “are you frequently depressed?” do not mean the same thing to differ- 
ent individuals. Besides applying his own interpretation to the word “depressed” 
a person must also decide what is meant by the word “frequently.” Does this 
mean 60 per cent of the time or 80 per cent of the time? A study by Simpson? 
has shown that words like “frequently” have little common meaning. His results 
for several words which are widely used in personality inventories are’ presented 


in Table 16.1. 
Table 16.1 


RANGE OF MEANINGS STUDENTS ATTRIBUTED TO QUANTITATIVE TERMS 
COMMONLY USED IN PERSONALITY INVENTORIES* 


Portion of Directions-> “Simply indicate how many times out of 100 you think the word 
indicates an act has happened or is likely to happen.” 
25 per cent of students 25 per cent of students 
Results thought the term meant thought the term meant 
less than this percentage more than this percentage 
of the time. of the time. 
Usually 73 87 
Often 52 82 
Frequently 60 80 
Sometimes 15 37 
Occasionally 14 33 
Seldom 6 17 


* Adapted from R. H. Simpson, “Stability in Meanings for Quantitative Terms: A Com- 
Parison Over 20 Years,” Quarterly Journal of Speech, 49, 146-151, 1963. 


The limitations of personality inventories are such that their use should be 
severely restricted in school situations, They are probably most useful as a 
general screening instrument for identifying pupils who should be studied more 
closely by the school counselor. If scored at all, only the total score should be 


inventories, the interpretation and use of the results should be left to the psy- 
chologically trained counselor, 


Projective Techniques 


Projective techniques provide another method of evaluating personal-social 
adjustment with which the classroom teacher should be familiar. Since they 
generally require clinical training to administer and interpret, it is not expected 
that a teacher will use them directly. It is more than likely, however, that he 


will encounter some clinical reports on pupils which contain interpretations of 
projective test results, 


° R. H. Simpson, “Stability in Meanings for Quantitative Terms: A Comparison Over 
20 Years,” Quarterly Journal of Speech, 49, 146-151, 1963. 


Peer Appraisal and Self-Report 351 


In contrast to the highly structured personality inventory, projective tech- 
niques provide almost complete freedom of response. Typically, the individual 
is presented with a series of ambiguous forms or pictures and asked to describe 
what he sees. His responses are then analyzed to determine what content and 
structure he has “projected” unto the ambiguous stimuli. 

The two most extensively used projective techniques are the Rorschach Inkblot 
Test and the Thematic Apperception Test (TAT). The Rorschach consists of 
ten inkblot figures on cards and the TAT includes a series of thirty pictures, 
of which only twenty are used for any particular age or sex group. These tests 
are typically administered to one individual at a time and a complete record 
is made of the individual’s responses during the testing. Analysis of the results 
requires both systematic scoring and impressionistic interpretation with major 
emphasis on the total personality pattern revealed. Projective techniques are 
used Primarily as an aid to a complete clinical study of those individuals who 
are experiencing adjustment difficulties. 


Interest Inventories 


There are several informal methods of obtaining information concerning a 
Pupil’s interests, Through direct observation, we can note which areas of study 
receive his greatest attention, what types of books he reads, and which activities 
he selects in free-choice situations. By means of activity checklists, we can 
obtain reports on the things he has done on his own. Through direct question- 
ing, we can have the pupil tell or write about the things he would most like to 
do. All of these methods provide clues to a pupil’s interests, but they also have 
their shortcomings. Our observations are usually restricted to in-school activi- 
ties, the things done by pupils reflect home environment and opportunity as 
much as interest, and a pupil’s directly expressed interests are limited by his 
Seneral knowledge and his ability to think of specific activities at the time he 
18 asked. In addition, these informal methods provide no basis for comparing 
an individual’s interests with those of persons in other educational or vocational 
8Toups. 

Standardized interest inventories overcome many of the limitations of the 
informal methods. Unfortunately, however, these have been designed primarily 
for use in educational and vocational guidance. The development of standardized 
inventories for use in curriculum planning and instruction has been generally 
neglected, 

One of the most widely used interest inventories at the high school and col- 
lege levels is the Kuder Preference Record—V ocational. This inventory contains 
a number of activities arranged in groups of three. The pupil is forced to decide 
Which one of the three activities he likes most and which one he likes least. An 
example of the type and arrangement of items used is shown in Figure 16.11. 

© scoring of the inventory provides a profile of interests in ten areas: Out- 

oor, Mechanical, Computational, Scientific, Persuasive, Artistic, Literary, 
usical, Social Service, and Clerical. It also has a “verification score” to show 


Whether the responses were carefully made. 


352 Evaluating Procedures, Products, and Typical Behavior 


A number of activities are listed in groups of three. Read over the three activities in each group. Decide 
which of the three activities you like most. There are two circles on the same line as this activity. 

Punch a hole with the pin through the left-hand circle following this activity. Then decide which activ- 
ity you like least and punch a hole through the right-hand circle of the two circles following this activity. 


In the examples below, the person answering has indicated for the first group of three activities, that he 
would usually like to visit a museum most, and browse in a library least. In the second group of three 
activities he has indicated he would like to collect autographs most and collect butterflies least. 


EXAMPLES 


Put your answers to these questions in column O. 


+ Visit an art gallery . 
+ Browse ina library . . è <—LEAST 
Visit a museum . ... 


Collect autographs , . 
+ Collectcoins . . .. 


Collect butterflies .. . eUo| <—teast 


Figure 16.11. Sample items from the Kuder Preference Record-Vocational. (Copyright 
1948 by G. Frederic Kuder. Reprinted by permission of Science Research Associates, Inc.) 


In addition to the Vocational F. orm, there is a Personal Form which measures 
preferences for various personal and social activities and an Occupational Form 
which relates preferences to specific jobs rather than general interest areas. All 
forms of the Kuder Preference Record use the forced-choice technique of respond- 
ing, are simple to administer and score, and are relatively easy to interpret to 
pupils. 

Another interest inventory that is used extensively at the high school and 
college levels is the Strong Vocational Interest Blank, This inventory consists of 
400 items, the majority of which are answered by circling L, I, or D (like, 
indifferent, or dislike). The items are grouped into the following eight parts: 
(1) occupations, (2) school subjects, (3) amusements, (4) activities, (5) pecu- 
liarities of people, (6) order of preference of activities, (7) comparison of inter- 
est between two items, and (8) rating of present abilities and characteristics. 
Sample items, for preferences toward occupations, are shown in Figure 16.12. 

The Strong Blank is scored and interpreted in terms of the similarity be- 
tween an individual’s interests and those of persons successfully engaged in 
particular occupations. The men’s form can be scored for 54 occupations an 
the women’s form for 31. The blank must be scored with a different key for 
each occupation. Electronic scoring machines have simplified the process of scor- 
ing an answer sheet for all occupations, Special group keys also make it possible 
to score the blank for clusters of occupational interest, somewhat similar to 
those provided by the Vocational Form of the Kuder. 

Relatively few interest inventories have been developed for use at the ele- 
mentary school level. Typical of the instruments published for use with this 
age group is the inventory, entitled What I Like to Do—An Inventory of Chil- 


Peer Appraisal and Self-Report 353 


Part 1. Occupations. Indicate after each occupation listed below whether you would like that kind of work or not. 
Disregard considerations of salary, social standing, future advancement, etc. Consider only whether or not you 
would like to do what is involved in the occupation. You are not asked if you would take up the occupation per- 
manently, but merely whether or not you would enjoy that kind of work, regardless of any necessary skills, abilities, 
or training which you may or may not possess. 

Draw a circle around L if you like that kind of work 

Draw a circle around | if you are indifferent to that kind of work 

Draw a circle around D if you dislike that kind of work k: 
Work rapidly. Your first impressions ore desired here. Answer all the items. 
irrelevant items ore very useful in diagnosing your real attitude. 


Many of the seemingly trivial and 


1 Actor (not movie) 46 
2 Adverti 47 

3 Rehat, 48 Labor Arbitrator pE . 
4 Army Offi 49 Laboratory Technician. . 
5 Anite ae @ 50 Londscope Gordener 


6 Astronomer nue 51 Lawyer, Criminal 

t Athletic Director lore Corporation 

9 atone navel aves Life Insurance Salesman . 
10 Author of technical bool Locomotive Engineer 


Figure 16.12. Sample items from the Strong Vocational Interest Blank. (Copyright 1938 
by Stanford University. Used by permission.) 


dren’s Interests. There are 294 items, like those in Figure 16.13, designed to 


Measure interests in art, music, social studies, active play, quiet play, pate 
arts, home arts, and science. The inventory is recommended for use ao e 
4 through 7, and provides norms based on a national sample of pupi s. 
manual provides suggestions for using the results in curriculum development, 


guidance, and instruction. 


Would you like to... 


1. Make pictures with crayonss+serssererererceerers 


2. Carve things out of Wood.sesseeesereersececersesscreser® 


31. Take singing lessons.-+++++ 


32. Go to a concert................ 


109, Play tug-of-wareseeeeeeeerereeerereee® 


110. Pitch horseshoese.seeseesreseerrsetet® 


233. Leam how fish take care of their young..ssessscessecsees 


234. See pictures of unusual kinds of fish....................... 


Figure 16.13. Sample items reprinted from What 1 Like to Do: An Inventory of Children’s 
Interests, by Louis P. Thorpe, Charles E. Meyers, and Marcella R. Sea. (Copyright 1954, 


Science Research Associates, Inc.) 


354 Evaluating Procedures, Products, and Typical Behavior 


As with other self-report techniques, responses to interest inventories ean be 
easily faked. This is seldom a problem, however, where the emphasis as, on 
self-understanding, and educational and vocational planning. Pupils are anxious 
to find out about their interests and the inventories consist of items which tend 
to be psychologically nonthreatening. 

The instability of pupils’ interests during elementary and high school years 
is a major reason for using interest inventories with extreme caution at these 
levels. Extensive studies by Strong’? have shown that interests are not very 
stable until approximately age 17. This does not mean that we must wait until 
this age to measure interests, but rather that our interpretations must be highly 
tentative. In one sense the instability of interests among children and adolescents 
is highly encouraging, for it indicates that our efforts to broaden and develop 
interests through school activities have some chance of succeeding. It is mainly 
when we are attempting to predict vocational success that stability poses a 
serious problem. For vocational decisions, we should rely most heavily on 
interest measures obtained during the last two years of high school, and later. 

Another precaution to keep in mind is not to confuse interest scores with 
measures of ability. A strong interest in science, for example, may or may jor 
be accompanied by the verbal and numerical aptitudes needed to pursue SUC: 
cessfully a course of study or career in science. A scientific interest may be satis: 
fied by collecting butterflies or by discovering a cure for cancer. Interest 
measures merely indicate whether an individual is apt to find satisfaction = a 
particular type of activity. Measures of ability determine the level of activity 
at which the individual can expect to function effectively. 


Attitude Scales 


The two chief methods for evaluating pupils’ attitudes are (1) direct observa 
tion, and (2) attitude scales. When attitudes are specifically developed an 
evaluated as instructional outcomes, we must rely mainly on observational tech- 
niques. Our procedure here is to describe in behavioral terms the attitudes to 
be evaluated (e.g., concern for others, scientific attitude, and so on) and to 
gather evidence of these changes by means of anecdotal records, rating scales, 
and checklists. Self-report methods are generally infeasible for this purpose 
because responses can be easily faked and because pupils have strong motiva- 
tion for doing so where course grades might be affected. 

Attitude scales are self-report inventories designed to measure the extent to 
which an individual has favorable or unfavorable feelings toward some person, 
group, object, institution, or idea. They are primarily useful where the individual 
has little reason for distorting the results, such as in the development of self- 
understanding or in research. A common research use is in the study of attitude 
change resulting from particular experiences (e.g., reading, motion pictures; 


19 E. K. Strong, Jr., Vocational Interests of Men and Women (Stanford, California: Stan- 
ford University Press, 1943). 


Peer Appraisal and Self-Report 355 


group discussion, and so on). Group results, based on anonymous responses, 
can also be used as an aid in evaluating curricular and extracurricular programs, 
Specific educational practices, and teaching effectiveness. 

A number of different methods of constructing attitude scales have been 
developed. Three of the most common are those originated by Thurstone, Rem- 
mers, and Likert.1! 

Thurstone’s Method. The procedure developed by Thurstone includes the 
following steps: 

l. A series of statements expressing all ranges of opinion toward some atti- 
tude object are written or collected. For example, in preparing a scale to meas- 
ure attitude toward school, a large number of items like the following might be 
gathered: 

School is exciting. 

School is sometimes interesting. 

School is a waste of time. i i 
A good pool of such items for teacher-made scales can be obtained by having 
Pupils write a series of statements representing different degrees of the attitude 
to be measured. 
i 2. The statements are edited, pl ° 
into eleven piles by thirty or more J 
Scores represent favorable attitudes, 
Statements expressing the most favora 


aced on slips of paper or cards, and sorted 
udges. Where it is desired to have high 
the judges are instructed to place the 
ble attitudes in pile 11, those expressing 
a neutral position in pile 6, those expressing the least favorable attitude in 
pile 1, and the remainder in one of the intervening piles. In constructing teacher- 
made scales, any group of teachers or parents ms = as judges, since it is 
assumed that this sorting process is not influence y the attitudes of the judges. 
As noted above, the judges are asked merely to classify the statements, not to 
indi ; atti j 

“wasis ee a statement is placed in each pile provides the basic 
data for determining the ambiguity and scale bi of the item. Where there is 
considerable disagreement in the “aasan wi PF it is regarded as too 
ambiguous and discarded. The scale value of each o the usable statements is 
based on the median position assigned by the judges. I 

4. The final form of the attitude scale is constructed by selecting those state- 
ments which are most relevant, least ambiguous, and which cover the entire 
range of scale values. The statements are then arranged in random order, and 
the subject is simply told a er "= wes so with which he agrees. His 
score is obtained by averasins me ka aper vesiak the statements he has checked. 

Remmer’s Method. To ampi e in Se amount of work required in 
building a separate attitude = s ya peiie attitude object, Remmers 
has developed generalized OA TOSS itude scales. These are designed so that 


Techniques of Attitude 


aA D. Bawards, Scale Construction (New York: Appleton- 


Century-Crofts, 1957). 


356 Evaluating Procedures, Products, and Typical Behavior 


the same statements can be used to measure attitude toward a series of attitude 

objects in the same general area. For example, statements like the following 
š a sss : ° ¿asiya AR 

are included in his scale for measuring attitudes toward any institution. 


1. Is perfect in every way. 

2. Is the most admirable of institutions. 

3. Is necessary to the very existence oÍ civilization. 
4. Is the most beloved of institutions. 


In responding to the scale, the subject writes in the name of the P s. 
indicated by the examiner and then checks those statements with which he 
agrees. 

Remmer’s master scales are constructed and scored in the same manner 85 
the Thurstone-type scale. The major differences are that the attitude stateman 
in the master scales are necessarily more generalized and they are arrange 
in order of decreasing favorableness, rather than in random order. A number 
of master attitude scales have been developed under Remmer’s direction. Thee 
include measures of attitude toward such things as (1) any disciplinary pre 
cedure, (2) any school subject, (3) any teacher, (4) any national or racia 
group, (5) any proposed social action, and (6) any vocation. aA 

Likert’s Method. This approach to attitude scale construction is less ee 
consuming than the other two methods because it does not require the ae 
of statements by judges. It also differs in that (1) only clearly favorable e. 
clearly unfavorable attitude statements are used, and (2) the subject is pegy 
to respond to each statement on a five-point scale: strongly agree (SA), agre 
(A), undecided (U), disagree (D), and strongly disagree (SD). 

Statements in a Likert-type scale might appear as follows: 

SA A U D SD School is exciting. 

SA A U D SD School is a waste of time. 
In scoring favorable statements, like the first item above, the alternatives are 
weighted 5, 4, 3, 2, 1, going from SA to SD. In scoring unfavorable statement% 
like the second item above, these weights are reversed. Thus, a pupil circling 
SA on both of the above items would receive five points for the first and one 
point for the second. An individual’s total score on this type scale is the ee 
of his scores on all items, with the higher score indicating a more favorable 
attitude. a 

The Likert-type scale provides results which are comparable to those obtaine 
by the Thurstone and Remmers methods." Its greater simplicity of construction 
and scoring would seem to favor its use. 

A Final Precaution. Attitude scales, like other self-report techniques, PT 
vide verbal expressions of feelings and opinions that individuals are willing to 
make known to others. Their effective use requires a good rapport with the 


12H. H. Remmers, Introduction to Opinion and Attitude Measurement (New York: 
Harper & Row, 1954) . z 

“A. L. Edwards, Techniques of Attitude Scale Construction (New York: Appleton- 
Century-Crofts, 1957). 


Peer Appraisal and Self-Report 357 


individuals tested and a sincere belief on their part that frank responses are 
in their own best interests. Even under the most ideal conditions, however, it is 
desirable to supplement attitudes determined by self-report methods with evi- 
dence obtained from direct observation. 


SUMMARY 


In some areas of learning and development it is desirable to supplement the 
teacher’s observations with information obtained directly from the pupils. We 
can ask the pupils to rate or judge their peers (their fellow pupils) and to 
report on their own feelings, thoughts, and past behavior. A variety of (1) 
Peer-appraisal methods, and (2) self-report techniques have been developed 
for this purpose. Y 

Peer appraisal is especially useful in evaluating personality characteristics, 
Social relations skills, and other forms of typical behavior. The give-and-take of 
Social interaction in the peer group provides pupils with a unique opportunity 
to observe and judge the behavior of their fellow pupils. Since these peer ratings 
are based on experiences which are seldom fully visible to adult observers, they 
Provide an important adjunct to other methods of evaluating personal-social 
development. 

Peer-appraisal methods include the “guess who’ h t 
technique, and social relations scales. The first of these techniques requires 
Pupils to name those classmates who best fit each of a series of behavior descrip- 
tions, The number of nominations each pupil receives on each characteristic 
indicates the reputation he holds among his peers. This nominating procedure 
can be used to evaluate any aspect of behavior which js observable to fellow 
Pupils. The sociometric technique also calls for nominations but here the pupils 
are to indicate their choice of companions for some group eaten or activity. 
The number of choices a pupil receives serves as an indication of his social 
acceptance and the network of choices can be used to plot the social structure 
of the group. The results can also be used to rearrange groups, to improve the 
Social adjustment of individual pupils, and to evaluate the influence of school 
Practices on pupils’ social relations. Published social relations scales are also 
available for some of these purposes- They are less flexible than the traditional 
Sociometric technique but they have the advantage of standardized procedures 


of administration and scoring. 


Self-report techniques are ty n: 
accessible by other means. This includes reports on the pupil’s past experiences 


and his perceptions of his inner life. Such information can be obtained by 
personal interview but 4 self-report inventory is more commonly used. The 
inventory is a sort of standardized written interview which provides comparable 
results from one person to another. Effective use of self-report techniques as- 
sumes that the respondent js both willing and able to report accurately. Thus 
special efforts must be made to meet these conditions, i 

Activity checklists provide a survey of the pupil's past experiences which is 
useful in assessing learning readiness and in curriculum planning. Problem 


° technique, the sociometric 


pically used to obtain information which is in- 


358 Evaluating Procedures, Products, and Typical Behavior 


checklists, personality inventories and projective techniques aid in evaluating 
the personal-social adjustment of pupils. Of these, the problem checklist is me 
only one recommended for use by the classroom teacher. Interest inventories 
contribute to a better understanding of pupils and are especially useful in we 
cational and vocational planning. Attitude scales provide an indication of He 
feelings and opinions pupils hold toward various groups, institutions, and ae 

Peer ratings and self-report inventories provide useful information for um A 
standing pupils better and for guiding their learning, development, and aaue 
ment. These purposes will be best served, however, when the intona ta 
combined with test results, observational data, and all other available da 
concerning the pupils. 


SUGGESTIONS FOR FURTHER READING 


er 

Anastasi, Anne. Psychological Testing. 2nd edition, New York: Macmillan, 1961. Chapt 

18: “Self-Report Inventories.” Chapter 19: “Measures of Interests and Attitudes. 1963. 
Bauernfeind, R. H. Building a School Testing Program. Boston: Houghton Mifflin, Per- 

Chapter 11: “Measuring Vocational Interests.” Chapter 12: “Measuring Students 

sonality Characteristics.” d edition 
Bonney, M. E. “Sociometric Methods,” Encyclopedia of Educational Research. 3r 

New York: Macmillan, 1960. Pages 1319-1324. 60. Chap: 
Cronbach, L. J. Essentials of Psychological Testing. New York: Harper & Row, 19 M asure 

ter 14: “Interest Inventories.” Chapter 15: “General Problems in Personality Me 

ment.” 3rd edi- 
Sells, S. B., and D. K. Trites. “Attitudes,” Encyclopedia of Educational Research, 

tion, New York: Macmillan, 1960. Pages 102-115. York: 
Super, D. E. “Interests,” Encyclopedia of Educational Research. 3rd edition, New 

Macmillan, 1960. Pages 728-733. h logical 
Super, D. E., and J. O. Crites. Appraising Vocational Fitness by Means of Psycho 

Tests. New York: Harper & Row, 1962. Chapters 16-19. Interest and attitude tests. an 
Thorndike, R. L., and Elizabeth Hagen. Measurement and Evaluation in Psychology - 


Education. New York: John Wiley & Sons, 1961. Chapter 12: “Questionnaires an 
ventories for Self-Appraisal.” 


Chapter 17 
improving learning, 
marking, and 
reporting 


wee NS) 


The main function of evaluation in teaching is the improvement of 
Pupil learning. .. . In this chapter we shall discuss some of the ways evalua- 
tion can contribute to this end. . - - 2 4 closely related matter—the improve- 


ment of marking and reporting—is also considered. 


Emphasis throughout this book has been on the need to identify all impor- 
tant objectives of instruction, to state these objectives clearly and specifically 
in behavioral terms, and to select or develop the evaluation instruments which 
Provide the most valid information for instructional purposes. How much we 
depend on standardized tests, teacher-made tests, observational techniques, peer- 
appraisal devices, and self-report methods will vary with the area in which we 
are teaching and with the age level of the pupils. In areas where performance 
skills are the major outcomes (such as, music, art, physical education), and 
With younger pupils, we shall need to rely more heavily on anecdotal records, 
rating scales, checklists, and similar nontest procedures. In other areas, like 
Social studies and mathematics, both standardized tests and teacher-made tests 
are likely to play a much more prominent role. Despite this variation in emphasis 
from one situation to another, effective teaching usually requires the use of a 
Variety of evaluation techniques. This is because objectives for any course are 
complex and varied, and because a comprehensive knowledge of pupils is needed 
to effectively guide their learning and development. 


THE ROLE OF EVALUATION IN THE 
IMPROVEMENT OF LEARNING 


The evaluation process can facilitate pupil learning in 
and indirect ways. Some of these were described in the sec! 


361 


a number of direct 
tion on standardized 


362 Using Evaluation Results in Teaching 


testing and others were suggested during discussions of various evaluation 
procedures. Here, we shall summarize and make more explicit some of the 
direct ways that evaluation can contribute to improved learning and instruc- 
tion. In general, evaluation can help in (1) clarifying the goals of learning, 
(2) understanding the learner, (3) motivating learning, (4) increasing reten- 
tion and transfer of learning, and (5) diagnosing and remedying learning 


difficulties. We shall consider the specific role of evaluation in each of these 
areas. 


Clarifying the Goals of Learning 


A systematic approach to evaluation provides for the clarification of learn- 
ing goals at several points: (1) during the planning stage (where the goals are 
defined for evaluation purposes), (2) during instruction (where the goals are 
shared with pupils), and (3) during evaluation (where the instruments provide 
pupils with an operational definition of the goals). 

Ideally, plans for evaluation are made at the same time instructional plans 
are formulated. This increases the likelihood that the desired learning outcomes 
will be clearly defined before instruction begins. Although goals can be devel- 
oped without special attention to evaluation, they are not apt to be as clear and 
definite. Planning for evaluation encourages us to describe in precise terms the 
behaviors we are willing to accept as evidence of learning. 

Goals which have been explicitly defined in behavioral terms are of obvious 
value in selecting instructional materials and methods, and in organizing learn- 
ing activities. They are also useful, however, during the guidance of pupil 
learning. The precise descriptions of behavior make signs of learning progress, 
or lack of progress, more readily apparent during teaching. If we are assisting 
pupils to think more critically, for example, it is helpful to know that critical 


thought is represented by such behaviors as “the ability to distinguish fact 
from opinion” 


EONA 3 and “the ability to recognize assumptions underlying conclu- 
sions, 


° These specific behaviors enable us to provide more meaningful learning 
experiences and to more readily observe and correct errors in thinking. The 
Same is true in teaching pupils to speak effectively, write effectively, develop 
understandings, learn performance skills, and the like. We are more apt to 
provide proper direction, if we have clearly and explicitly identified the specific 
behaviors that represent successful performance. 

For effective learning, the goals should be clear to the pupils, as well as to 
the teacher. This can be accomplished either by presenting them with statements 
of the goals, or by letting them participate in defining the goals during class- 
room planning. The development of rating scales, checklists, and other evalua- 
tion devices also provides an opportunity to clarify for the pupils the expected 
learning outcomes. The pupils may help develop the instruments or, as a mini- 


mum, simply be informed about the specific behaviors measured by the 
instruments. 


Improving Learning, Marking and Reporting 363 


Despite the care with which goals are defined and shared with pu ils, th 
ak P to have little direct influence on learning unless they are also a ee 
with the evaluation procedures used. Note this warning by Cronbach?: 
oo =s. Pol. The pa Š: learning are supposed to be 
anticipates. Goals not reflected in j Pa akedik 7 il e eee 
some objectives affects marks; the pupil > ly li vi ie mail a oe = 
RT A SA ; pupil pays only lip service to other objectives not repre- 
Pit =s analysis, then, it is the evaluation procedures that determine the 

on goals of instruction. If tests require the “application of facts and 
Principles,” pupils are less likely to limit their study to the memorization of 
he ea bits of knowledge. If evaluations of laboratory performance include 
eei skill in the use of laboratory equipment,” pupils are less apt to 
oes ee of this aspect of laboratory work. If the “ability to 
was s ectively is constantly being observed, pupils will tend to di- 
oh ner ention to spelling, grammar, pronunciation, and other aspects of 

munication. The evaluation procedures indicate to the pupils which 
goals are worth working toward and what specific behaviors are needed to 
attain these goals. This highlights the importance of evaluating progress toward 
all desired learning outcomes. It is only then that we can have any assurance 
that the goals of the teacher and the pupils are in accord. 
ite a the evaluation process can aid in clarifying goals for both 
Si el pupil. Defining goals in behavioral terms contributes to better 
aie ce planning and more effective guidance of learning activities. In 
Non ae a e evaluation procedures provide pupils with an operational defini- 

e goals to be achieved. 


Understanding the Learner 


One of the most readily apparent co 
proved teaching and learning is that o 
cerning the learner. From the teacher’s stan 


classified into two general types: (1) that whi 
and (2) that obtained from classroom testing and 


des general background information which helps 
f abilities of the pupils we are teaching. The 
pecific to the goals of the course. This aids 
dify our judgments concerning the pupils’ 
new learning experiences. 

luation program in the school, 


ntributions evaluation can make to im- 
f providing increased information con- 
dpoint, this information might be 
ch is obtained from the school- 


wide evaluation program, 
evaluation. The first type provi 
us understand the level and range © 
Second type provides information s 
us to continuously refine and mo 
strengths and weaknesses and their readiness for 

If there is a systematic and comprehensive eva 
We can learn a great deal about our pupils before meeting them. A review of 


each pupil’s cumulative record should provide information concerning his 
Scholastic aptitude, record of growth in the basic skills and other areas of 
1L. J. Cronbach, Educational Psychology (New York: Harcourt, Brace & World, 1962) 
Page 542. š 


364 Using Evaluation Results in Teaching 


achievement, personal-social development, health, home background, and the 
like. Such information prior to instruction makes it possible to take into account 
the abilities and needs of pupils during course planning. If some pupils are 
noted to be deficient in basic skills, for example, review or remedial work can 
be planned for them. In other instances it might be necessary to modify course 
goals or instructional plans to rectify common areas of weakness in achieve- 
ment. Also, the range of individual differences in aptitude and achievement 
might suggest the need for within-class grouping or for the use of instructional 
materials of several levels of difficulty. : 

It is frequently desirable to supplement the general information obtained 
from the cumulative records with the results of a pretest given at the beginning 
of the course. Pretests provide up-to-date information concerning the pupils 
readiness for new learning activity and help identify areas of instruction which 
should be emphasized or deemphasized. Pretesting usually takes one of tye 
forms. (1) Giving a test on those basic skills or concepts which are prerequisite 
to learning the material in the present course. For example, a teacher of begin- 
ning algebra might administer a test of computational skill in arithmetic. (2) 
Giving a test which covers the material to be taught in the course. For example, 
a teacher of social studies might administer a comprehensive achievement test 
at the beginning of the course to determine what the pupils already know. The 
results of pretesting can serve as a basis for remedying deficiencies in aa 
requisite skills and for making further modifications in the materials an 
methods of instruction. è 

Periodic testing and evaluation during the course may also reveal clues which 
aid in adjusting instruction to meet pupil’s needs. A test at the end of a “weather 
unit,” for instance may show inadequate knowledge of weather inatromen 
thereby suggesting a review, or a trip to the local weather station. Ratings id 
laboratory performance in science might reveal errors in procedure that coul 
be cleared up by a demonstration to the class. Poorly written answers on an p 
test in social studies might indicate the desirability of providing pupils w 
more practice in organizing and expressing ideas. À test of computational ski 
in mathematics might suggest review for some pupils and more advanced work 
for others. Thus, each test and evaluation instrument provides information con- 
cerning successes and failures in learning which enables us to take prompt 
corrective action and to provide future experiences in closer harmony with the 
learning readiness of pupils. f 

In summary, evaluation data help us understand the abilities and needs o 
our pupils, their readiness for new learning experiences, and their progress 
toward the course goals. Information from the school-wide evaluation program 
contributes to more effective precourse planning, and periodic evaluation during 
the course makes it possible to adjust instruction to the progress of the group 
and to the needs of individual pupils. 


Motivating Learning 


There are two major ways that the evaluation process can facilitate pupil 


Improving Learning, Marking and Reporting 365 


motivation: (1) by providing immediate, attainable goals toward which to 
work; (2) by providing knowledge of learning progress. 

Working toward remote goals, without the encouragement of intermediate 
consequences, has little meaning for children and adolescents. The teacher who 
attempts to motivate his class with urgings that “this will be of great value in 
adult life” has little chance of success. Pupils need short-term goals to serve 
as guideposts along the way. Tests, ratings, and other evaluation procedures 
serve this purpose. 

Both research and personal experience substantiate the fact that the mere 
expectation that a test will be given tends to stimulate learning activity. There 
is also considerable evidence that the type of test anticipated will influence how 
and what pupils study. Research findings have shown that pupils tend to con- 
centrate on trends, relationships, and the organization of material when prepar- 
ing for essay tests and on factual details when studying for objective tests.” In 
Passing, it should be noted that the objective tests used in these studies empha- 
sized knowledge of factual details rather than understandings, the application 
of facts and principles, and similar complex learning outcomes. Consequently, 
the results should not be interpreted to mean that objective tests per se encour- 
age the memorization of factual information. These findings are simply in 
harmony with the more general principle that pupils will tend to emphasize 
those learning outcomes which are reflected in the evaluation procedures. 

In addition to arousing and directing learning activity toward definite short- 
term goals, evaluation procedures contribute to motivation by letting pupils 
know how well they are doing. The results from tests, ratings, checklists, and 
other evaluation devices provide continuous feedback to the learner concerning 
his successes and failures. Such feedback enhances learning by reinforcing cor- 
rect responses and by identifying errors that should be eliminated. 

A study by Page? has shown that the type of feedback is an important element 
in motivation. Teachers in 74 classes, in grades seven to twelve, were asked to 
administer an objective test to their classes, to place the score and grade on 
each paper, and then to randomly assign the papers to one of three groups. 
Group 1 papers received no comment beyond the score and grade assigned to 
all papers. Group 2 papers received general and encouraging comments, like 
“Good work. Keep at it.’ Group 3 papers received the specific comment the 
teacher thought desirable under the circumstance: A later follow-up with a 
second objective test showed that the highest scores were achieved by the pupils 
who had received the specific comments (Group 3), the next highest by those 
given the general comments (Group 2)s and the lowest by those receiving no 
comment on their papers (Group 1). The motivating effect of the comments did 
not appear to be dependent on the school, grade level, or ability of the pupil. 


2 W. W. Cook, “The Functions of Measurement in the Facilitation of Learning,” Educa. 
. W. , g, - 


tional Measurement, ed. E. F. Lindquist (Washington, D. C.: American Council on Educa- 


tion, 1951). aStad 
3 E. B. Page, “Teacher Comments and Student Performance: A Seventy-Four Cla: 


Experiment in School Motivation,” Journal of Educational Psychology, 


Ssroom 


49, 173-181, 1958. 


366 Using Evaluation Results in Teaching 


To be most effective, feedback to the learner should be immediate, as well 
as specific.* This principle of immediacy is carried to the ultimate in the teach- 
ing machine, which is simply a device for presenting an orderly sequence of 
explanations and questions. As the learner responds to each question, he is 
informed immediately whether his response is correct or incorrect. Although 
we cannot hope to match this feature of the teaching machine in our routine 
use of evaluation procedures, there are steps that can be taken to implement the 
principle. First, we can return test papers and all other evaluation results to 
pupils as soon as possible. Second, we can make specific comments on pupils 
papers so that they will have a clear notion of what they did well and where 
they need improvement. Third, we can help pupils develop self-evaluation 
skills. When pupils learn the qualities desired in a performance or a product 
and obtain experience in judging their own work in terms of these criterias 
they are better able to provide their own immediate feedback. This type of self- 
reinforcement and self-correction of errors is basic to learning and an ultimate 
objective of all education. Unless pupils develop “built-in” standards of per, 
Íormance they are not apt to do much learning on their own, neither during 
their school years nor after they leave school. 

In summary, evaluation procedures contribute to pupil motivation by pro- 
viding short-term goals and by providing feedback concerning learning progress: 
For maximum results the evaluation procedures should represent all of the 


important objectives of the course and the feedback should be specific and 
prompt. 


Increasing Retention and Transfer of Learning 


Evaluation procedures can contribute to greater retention and transfer of 
learning by (1) focusing attention on those learning outcomes that are mos 
permanent and most widely applicable, and (2) providing practice in applying 
previously learned skills and ideas in new situations. ag 

Evidence concerning the relative permanence of various types of lsarmnë 
outcomes is rather sparse. What there is seems to suggest that retention me 
creases as the outcomes become more complex. The results of a typical por 
in this area are presented in Table 17.1. In this study, tests were given to cO ` 
lege students in zoology at the beginning of the course, the end of the courses 
and again one year later. Note that for the simplest learning outcome, that ° 
naming animal structures, less than a fourth of the course gain was retained 4 
year later. In contrast, all of the gain in the ability to apply principles to new 
situations was retained, and the ability to interpret new experiments actually 
increased during the year following instruction. Such findings support the view- 
point that the permanency of learning will be enhanced by using evaluation pr” 
cedures that emphasize the more complex learning outcomes. 

We are interested not only in pupils retaining what they have learned, but 
we should also like them to be able to apply their learning to new situations. 


3 G. M. Blair, R. S. Jones, and R. H. Simpson, Educational Psychology (New York: Mac- 
millan, 1962). 


Improving Learning, Marking and Reporting 367 


A study of grammar should contribute to better oral and written expression, 
arithmetic skills should be useful in solving various types of problems, and 
scientific principles should be helpful in interpreting a variety of scientific 
phenomena. This positive transfer of learning is a major objective of education. 
We can expect each new situation the pupil faces to have some element of novelty 
which requires him to use his old learning in a new way. Unless transfer is 
Possible, learned material has very limited value. 


Table 17.1 


THE RELATIVE PERMANENCY OF DIFFERENT LEARNING 
OUTCOMES IN ZOOLOGY* 


Mean Score 


Beginning End One Percentage of 
of of Year Gain Which 
Type of Examination Exercise Course Course Later Was Retained 
1. Naming ani 
g animal structures 
Pictured in diagrams 22 62 31 = 
2; Identifying technical terms 20 83 67 
3. Recalling information 
A. Structures performi 
performing 
functions in type forms 13 39 34 79 
í B. Other facts 21 63 54 79 
- Applying principl 
ples to new 
situations 35 65 65 100 
9: Interpreting new experiments 30 57 64 125 


* Adapted from R. W. Tyler, Constructing Achievement Tests (Columbus: Ohio State 
niversity, 1934), page 76. 


Transfer of learning is most apt to occur seg o) alao a ai 
have wide applicability, (2) the pupils expect GURNA to weeny and (5) te 
Pupils recognize the similarity between the new situation and other situations in 

© : . 
which the learning was applicable. The evaluation process can contribute to each 
of these conditions. : ; 

In general, learning outcomes which emphasize the understanding of concepts 
and principles, the interpretation of materials, thinking skills, and other com- 
plex behaviors tend to have the widest applicability and therefore the greatest 
transfer value. Giving priority to such outcomes when selecting instructional 
goals and planning evaluation procedures should increase the likelihood that 
transfer will take place. The use of tests and other evaluation instruments which 
specifically require the application of ideas and skills in new situations can also 
facilitate transfer. Evaluation of this type teaches pupils to anticipate transfer 
and to seek out the familiar elements in the new situation. Of course, it also 
provides practice in making the applications. 

Another aspect of evaluation that has direct implication for the transfer of 
learning is that of evaluating typical behavior. When we make 


day-by-day 


368 Using Evaluation Results in Teaching 


evaluations of a pupil’s typical use of grammar, spelling, study skills, and the 
like, we are letting him know that we expect new learning in these areas to 
carry over to, and become a part of, his daily behavior. Helping pupils incorpo- 
rate their newly acquired understandings and skills into their normal behavior 
patterns adds greater assurance that the new responses will transfer both to 
in-school and out-of-school situations, 

To summarize, evaluation procedures can facilitate retention and transfer 
of learning by reminding pupils that retention and transfer are expected, by 


stressing complex learning outcomes, and by providing opportunities to apply 
newly learned material. 


Diagnosing and Remedying Learning Difficulties 


There are four major steps in the diagnosis and remediation of learning 
difficulties. (1) Determining which pupils are having learning difficulty. (2) 
Determining the specific nature of the learning difficulty. (3) Determining 
the factors causing the learning difficulty. 
procedures. Testing and evaluation can ma 
of these steps. 


Determining Who Is Having Difficulty. There are a number of methods 
for identifying those pupils who are experiencing learning difficulty. One of the 
most common is to compare the results of standardized achievement tests with 
the results of a scholastic aptitude test. If a pupil’s level of achievement is lower 
than his level of schoalstic aptitude, it is assumed that he is not achieving uP 
to potential and that a search for the difficulty is in order. This procedure is 
useful only if a number of cautions are borne in mind. First, if there is to be 
a direct comparison of scores, the achievement and scholastic aptitude scores 
must be expressed in comparable units. Second, the achievement and scholastic 
aptitude tests must be standardized on the same population. Third, the dis- 
crepancies between the achievement and aptitude scores must be relatively large 
to offset the possibility of the differences being due to measurement errors alone- 
Fourth, it must be recognized that all underachievers will not be detected by 
this method because some learning difficulties (e.g., poor reading ability) will 
tend to lower the scores on both tests and make it appear that aptitude and 
achievement are in agreement. This is especially likely with scholastic aptitude 
tests that emphasize school learned abilities, but it is somewhat applicable to 
all group tests of mental ability. e 

Another common method of determining learning difficulties is by analyzing 
profiles based on an achievement test battery. Such profiles (like those pre 
sented in Chapter 14) make it possible to compare a pupils achievement an 
each area with his general level of achievement. Weakness in a particular skill 
or content area might suggest further diagnosis and study, or remedial teaching. 
As noted in Chapter 14, the standard error of measurement must be taken into 
account when analyzing test profiles, so that chance differences are not inter- 
preted as being significant. x 

In some cases it is desirable to analyze a standardized achievement test item 


(4) Applying appropriate remedial 
ke a significant contribution to each 


Improving Learning, Marking and Reporting 369 


by item and to make a tally of those missed by each pupil. Items which are 
missed by a large number of pupils indicate areas where the class as a whole 
is doing poorly. This might suggest that either the test has inadequate content 
validity, or that changes in curriculum and teaching method are needed. The 
errors of each individual pupil can also be studied for clues to his particular 
learning difficulties. In some tests, this process is facilitated by grouping the 
items and by indicating what each group of items measures. This is illustrated 
by the diagnostic form presented in Figure 17.1. A major caution in using such 
forms, or any analysis of item responses to determine individual learning dif- 
ficulties, pertains to the small number of items representing each area. At best, 
such an analysis merely provides clues which must be followed up by further 
study and observation. 

Informal classroom evaluation procedures can also be used to detect learning 
difficulties. The same type of item-by-item analysis used with standardized tests 
can be applied to classroom tests to detect group and individual learning errors. 
Rating scales, checklists, anecdotal records and other observational devices also 


Provide clues concerning learning problems. The day-by-day observations and 


judgments of an experienced teacher are especially valuable, because he fre- 


quently can spot a pupil’s difficulty before it becomes serious. 
In determining which pupils are having learning difficulty, we should not 
fo] 


confine our efforts to those with problems in the basic skills and the content 
areas, Pupils who are having difficulty in social relations, emotional adjustment, 
and other aspects of personal-social development also require attention. Learning 
Problems of this type are significant in their own right and they have a direct 
bearing on the pupil’s learning effectiveness in other areas. 

Determining the Specific Nature of the Learning Difficulty. The diag- 
nosis of learning difficulties is a matter of degree. In some instances, the general 
procedures for locating pupils with learning difficulties provide sufficient infor- 

D = n 
mation for immediate corrective action. In other cases, it may be necessary to 
supplement this information by further diagnostic study before planning 
remedial work. In still others: the learning problem may be so persistent and 
aaa a upil should be referred to a specialist for intensive diagnosis. 

When a pu is learning difficulty is in the basic skill area, a logical follow-up 
Procedure is ‘is administration of a diagnostic test. Such tests are based on 
s make and thus provide a systematic method for 
locating the specific problem. Diagnostic tests tend to provide a more reliable 
ag ale a pupil’s errors than general achievement tests because they have a 
l p P creat representing each particular aspect of the skill being 
arger number of ite al ial Š 8 
m d. The test manuals and accessory materials accompanying published 

sau : vide suggestions for further diagnosi 
s typically pro k Se lagnosis and for use 

i dial work. 
of the test scores in remet ae . 

Since published diagnostic tests Pe limited almost entirely to the areas of 
ga ted arithmetic, it is frequently necessary to use more informal methods 

anng s. iPs learning difficulties. The proced f ; 
for diagnosing & PUP t item, described i z ure of analyzing a 
pupil’s responses to each tes š ibed in the previous section 


the common errors pupil 


diagnostic test 


> is one 


370 Using Evaluation Results in T. eaching 


1, Reading Vocabulary 
A. MATHEMATICS 
S TS erscssava 

B. SCIENCE 


C. SOCIAL SCIENCE 
L 31 - 45... 

D. GENERAL 
CI 40- 60.. 


2. Reading Comprehension 
E. FOLLOWING DIRECTIONS 
C 41, 62, 64, 68 Simple choice 


„Writing money 


mb. 


Ef 


16, 17, 21, 
22, 23, x] sen 
25 


“Consult Port 2 of the manual for uses 


Diagnostic Analysis of Learning Difficulties* 
CALIFORNIA ACHIEVEMENT TESTS—JUNIOR HIGH LEVEL BATTERY 


y 20......Vocobulory 


+ Basic vocabulary 
18 = 30.......... Basic vocabulary 
+++ Basie vocabulary 


+. Basic vocobulory 


1 48, 53 Square and 


63, 65, 6, ] Definitions and .. 
f 51, 52 Percentoge 
essen Commission ond 


4. eon Fundementals 


sess 


5. Mechanics of English 


i00 00000 


SERSE 


«Writing decimals 


Adding percentoges 
Denominate numbers 


directions 
= Meth. directions 
a. Map directions 
F. REFERENCE SKILLS 
r 12, Parts of book or 
LJ Or newspaper 
= + Use of dictionary 
== Use of index 
co able of contents 
— z, x 89, 90 Reading o groph 
Em, „library 
ses classifications 
93, x] 
L 100, 101 JU mop 
S INTERPRETATION OF MATERIAL 
105, 107, 
11,112, 
113, 15,]...... Directly stated 
L luz, 119, focts 


E. AUSTEACTION 
.. Simple combinations 


Borrowing, mixed 
„Topic or central 


Writing decimals 
«+ Fractional ports 
Benominate numbers 
F. MULTIPLICATION , 


idea 
139, Organization of 
141, topics 
143, bral „Sequence of 
145, 146 events 
3. Arithmetic Reasoning 
A. MEANINGS 


11, 2, 3, 4, 5..Mriting numbers 
«Writing per cent C 9, 98, 101, 


C 109, 101, 102.-Zeros in 


> 9%, 100, 101, 


+ Expone: iplie 
B. SYMBOLS, RULES, & EQUATIONS C 103, 105, 106 Cancellation, 


é. Spelling (100 - 129) See profilo 
Mult. num. and 


107, 108], Mixed 
109, 110 J numbers 
om .Froctions and 
19 decimals 
Cone, 13 Pointing off 
` decimols 
coi ., Per cent of 
seeseeseeeee number 
C ns.. .. Denominate numbers 
G. DIVISION 
116, 117, 118) 
g 119, 120, J rests 
121, 122 


Zoros in 
“quotient 
+ Remainders 


123, 124, 125; wm. 
126, 127, Inverting divisors 
128, 129, 130. 


128, 129, 130 Mixed numbers 
o aiak 


wmbers 


132, Tas at nting off 
decimals 


Fractional parts 


CAPITALIZATION 

Title of book 

„ Nomos of persons 
itles of persons 


9, 16, 17, Nomes of 
O 21, 22, z) ` ploces 
«Pronoun "I" 

First words of 
` sentences 


A. 
— 
co 


D. 

Fiat words of 

guetations 

—55.. wa. Special day 

4, 14, 20, 29 Over-copitolization 
B. PUNCTUATION 


31, 32, 37, 3 
40, 42, 47, 48,\Commas 
51, 53, 54, 56. 


C 33, 34, 39, 46 Apostrophes 
Question 

marks 

„Quotation marks 

Quotation within 

quotation 

935, 44, 55, 58 Over-punctuation 


C. WORD USAGE 


60, 64, 65, 
67, 69, F Good usoge 
77, 81, 85 
= 61, 66, 701)... Tense 
76, 87 
62, 68, 71, 
74, 78, Jerome speech 
80, 84 


5, 73, 79, 82 Number 
C172, 75, 83, 86 Cose 
cass- 9, .. Recognizing 


sentences 


HANDWRITING See profile | 


Figure 17.1, Sample form showing grouping of items for 
1957 by California Test Bureau. Used by permission.) 


diagnostic analysis. (Copyright 


Improving Learning, Marking and Reporting 371 


eee the — Another is to administer a general achievement test 
anser ead pupi to describe aloma the mental process he is following as he 
wWeslshesses x s This thinking aloud’ provides clues to the pupil's 
Soli be adhriamist a skill, and method of approaching problems. Since the 
ied factors E a individual basis, it is also possible to note any emo- 
žepni. oe un esirable habits which might be interfering with the pupil’s 
might also be el — S the specific nature of a pupil’s learning difficulties 
Tesults, enia B from his cumulative record. An examination of 
quently thro bee es, anecdotal records, and other evaluative data can fre- 
is alse ayit ry 1g) on the nature of a pupil’s present difficulty. 
Dasenmin ustep in searching for the causes of the problem. 

ing diane the Factors Causing the Learning Difficulty. Some learn- 
riculum em ‘ ran. De attributed to improper teaching methods, unsuitable cur- 
are fairly int re or comes eet complex course materials. Such instances 
ence the ean iE see, ecause a relatively large number of pupils will experi- 
Gas une lacata wi When this occurs, we should, of course, focus our atten- 
ën satori g an , correcting the shortcomings in our instructional methods 

erials. This is one of the major ways that evaluation results can con- 


past test 


This, of course, 


"ass fo: improved instruction. 
of ae “pie” to us here, however, are t 
iernat pupils which cannot be accounte 
Pupil and ee, causes of such problems, we must make 
scholastic 18 environment. The major areas to consi 
skills, 4) ee (2) reading, arithmetic, and languag 
Oie eae ealth and physical condition, (5) emotiona 
one ironment. Unfavorable factors in any of these areas 
e to learning problems. 


It should be noted that the causes of learning 
Complex and seldom can be fully determined by the classroom teacher. However, 


8 review of the pupil’s cumulative record, special testing and observations (as 
needed), an interview with the pupil, and possibly a home visit, should provide 
sufficient information on which to base remedial action. If the pupil’s learning 
Problem requires more extended study than can be accomplished within the 
Normal teaching situation, the pupil should be referred to a specialist. 
I Applying Remedial Procedures. There is no set pattern to be followed 
in helping pupils overcome learning difficulties. In some instances it may be a 
simple matter of review and reteaching. In others, an extensive effort to improve 
Motivation, correct emotional difficulties, and overcome deficiencies in work- 
study skills may be required. The specific remedial procedures used in any 
given case will depend on the specific nature of the learning difficulty and the 
factors which have caused and Spe ton a 

Testing and x ae ae pa a n i mone remedial programs. The 
usevof periodic testing E i e i 0g ing might serve any of the follow- 
ing functions: oh T ef pane ° specific types of responses that are 
expected; (2) e further diagnostic information about the pupil’s dif- 


he persistent learning difficulties 
d for by faulty instruction. To 
ke a careful study of the 
der are the pupils (1) 
e skills, (3) work-study 
l adjustment, and (6) 
might cause or 


difficulties are multiple and 


evalua 


provi 


372 Using Eualuation Results in Teaching 


culties and learning needs; (3) give the pupil a feeling of success through the 
use of a carefully graded series of test exercises; (4) enhance motivation KA 
providing short-term goals and immediate knowledge of progress; (5) provide 
information concerning the effectiveness of the remedial procedures. Other 


can, of course, also be used to provide feedback concerning learning progress 
and the success of the remedial program. fic 

Though the immediate aim of remedial work is the correction of ‘aba 
learning difficulties, our interest should not stop there. A careful analysis 1 
evaluation results during diagnosis and treatment will reveal learning ne 
that can be prevented and causal factors that can be modified. The ps 
result of a remedial program should be an improved curriculum and mo 
effective instructional methods, 


ic and 
M 1 i ostic an 
In summary, evaluation procedures are useful in all phases of diagn 


the general instructional program, attention in dia 
on the specific responses of individual pupils, 


IMPROVING MARKING AND REPORTING 


' ; rate ' z ms and 
If instructional objectives have been clearly defined in behavioral ter i 
evaluation procedures have been e: 


M r 
at best a very brief Teport form. The process is a highly subjective mg ke 
which there are relatively few helpful guidelines. This has led to marks 
progress reports which vary widely in composition and meaning. 3 upil 

The greatest confusion exists where an attempt is made to ae the 
progress in terms of a single letter grade (eg. A, B, C, D, or E). Shou some 
assigned mark represent level of achievement, gain in achievement, ae ai 
combination of the two? Should effort be included, or should high a yermi 
be given good marks regardless of effort? Should pupils be marked aa their 
of their own potential learning ability or in relation to the es from 
classmates? There are no simple answers to such questions. Practice pone 
school to school, and frequently from teacher to teacher within the pont aii 
system. Many schools have circumvented such problems by abolishing stems. 
gle letter grade and developing more elaborate marking and reporting ee fee 
We shall describe some of these shortly. But first, let us consider ee d 
functions to be served by marks. This should help identify the qualities 
ina marking and reporting system. 


Functions of Marks and Progress Reports 


i ; ific 
School marks and other reports of pupil progress serve a variety of pont 
functions in the school. These can best be described in relation to the users 
the reports. This includes (1) pupils and parents, (2) teachers and counselors, 
and (3) administrators. 


Improving Learning, Marking and Reporting 373 


eon to Pupils and Parents. The main reason for reporting to pupils 
cine is to facilitate the learning and development of the pupils. Conse- 
th y the specific functions to be served are somewhat the same as those of 
E S evaluation program. The reports should (1) clarify the goals of 
itis es (2) indicate the pupil’s strengths and weaknesses in learning, (3) 
(4) Ke ea understanding of the pupil’s personal-social development, and 
F ibute to the pupil’s motivation. N 
Q rd standpoint of pupil learning, most of the functions are probably 
NE hace by the day-to-day evaluation and feedback during instruction. How- 
> there also seems to be a need for a periodic summary of progress. Pupils 
find it difficult to integrate test scores, ratings, and other evaluation results 
auto an over-all appraisal of their success in attaining school objectives. The 
Periodic progress report provides this summary appraisal. In addition to giving 
Pupils a general picture of how they are doing, such reports also provide them 
With a basis for checking the adequacy of their own self-estimates of learning 
Progress, 

Questions are frequently raised about 
and progress reports for motivational pu 
cedures, it would seem to depend largely on how they are used. If a bad report 
is held out as a threat to stimulate pupils to work harder, the consequences are 
apt to be undesirable. However, when the reports are viewed as opportunities 
to check on learning progress, they are likely to have the same motivational 
values as properly applied tests. That is, they provide short-term goals and 
knowledge of results. Although the feedback concerning progress is not so 
immediate as that obtained from testing, properly prepared reports have the 
advantage of providing a more comprehensive and systematic picture of the 
pupil’s strengths and weaknesses in learning. 

Reports to parents should inform them of the goa 
heir children are making toward those goals. 
First, by knowing what the school is attempting to do, 
parents are better able to cooperate with school personnel in promoting the 
development of their children. Second, information about the successes, fail- 
ures, and problems their children are experiencing in school enable parents to 
give them the emotional support and encouragement that is needed. Third, 
summary reports of learning progress provide parents with a basis for helping 
their children make sound educational and vocational plans.” To serve these 
purposes adequately, the reporting “ile ax will need to contain as much infor- 
maron and Gel ae IE a a expect parents to comprehend and use. 

‘eachers and Counselors. Marks and progress reports 


the desirability of using school marks 
rposes. As with other evaluation pro- 


Is of the school and the 
progress t This is important from 
several viewpoints. 


Reports by T. ‘ 
co or to the instructional and guidance programs of the school by providing 
ntri mation about pupils. Such reports supplement and complement 


i d infor ; š ; 
patie a and other evaluative data in the cumulative records. If a pupil’s past 
est sco 


nown, we can better understand his 
achievements are k 4 present strengths and 


i d 
5R. L. Thorndike ant. 
tion (New York: John Wiley 


Hagen, Measurement and Evaluation. i 
w & Sons, 1961). valuation in Psychology and Educa- 


374 Using Evaluation Results in Teaching 


weaknesses and can better predict the areas in which he is likely to be successful 
in the future. The increased information provided by progress reports is espe: 
cially useful to teachers when planning instruction, diagnosing learning dif- 
ficulties, and coping with special problems of personal-social development. 
Counselors use the reports, along with other information, to help pupils develop 
increased self-understanding and make more realistic educational and vocational 
plans. Many Progress reports also provide information which is useful in 
counseling pupils who have emotional problems, 

The instructional and guidance functions of the school would seem to be best 


served by a reporting system that is both comprehensive and diagnostic. To 
guide learning effectively, aid in 


Marks and progress reports serve s 
number of administrative functions. They are used for determining promotion 
and graduation, awarding honors, determining athletic eligibility, and for xë 
Porting to other schools and Prospective employers. For most sei 
purposes, a single letter grade tends to be preferable, largely because ome 
marks are compact and can be easily recorded and averaged. With the increase 
use of machines for routine clerical work, this advantage will probably assume 
even greater importance in the future. A 

There is little doubt that the convenience of the single mark in administrative 
work has been a major factor in retarding the development of more compre- 
hensive and useful progress reports. This need not be the case, however. ue 
a new reporting system is being developed, it is possible to retain anie 
letter grades for administrative purposes and to supplement them with I. 
type of information needed by pupils, parents, teachers, and counselors. Att : 
high school level, the retention of letter grades is almost mandatory, since mos 
college admission officers insist on them. 


Types of Marking and Reporting Systems 


An effective system of marking and reporting will (1) provide the type j 
information that is needed by the users of the reports, and (2) present it as" 
clearly understandable form. These seem like relatively simple criteria to tan 
but most reporting systems fall far short. Much of the difficulty is due to ke 
variety of purposes such reports are expected to serve. As we have seen pr 
last section, for some uses we prefer comprehensive and detailed reports, Wil 
for others a single mark may be more desirable. An additional problem arises 
from variations in the educational backgrounds of the users of the reports. 
Information that is understandable to teachers and counselors may be con- 
fusing to many parents. Most marking and reporting systems represent = 
type of compromise between the need for detailed information and the nee 
for conciseness and simplicity. š 

Traditional Marking Systems. The traditional method of reporting pupil 


Improving Learning, Marking and Reporting 375 


progress, which is still in wide use today, consists of assigning a single letter 
grade (e.g., A, B, C, D, E) or a single number (e.g., 5, 4, 3, 2, 1) to represent 
a pupil’s progress in each subject. This system is concise and convenient but it 
has several notable shortcomings. (1) The meaning of such marks is often 
unclear because they are a conglomerate of such factors as achievement, effort, 
and good behavior. (2) Even where it is possible to limit the mark to achieve- 
ment only, interpretation is difficult. A mark of C may mean average work in 
all areas, or high performance in some areas and low performance in others. 
An over-all summary appraisal in the form of a single mark tells us nothing 
about the pupil’s relative success in achieving the various course objectives. 
(3) As typically used, letter grades have resulted in an undesirable emphasis on 
marks as ends in themselves. Many pupils and parents view them as goals to 
be achieved, rather than as means for understanding and improving pupil 
development. While this is not entirely the fault of the marking system, the 
lack of information provided by a single letter grade probably contributes to 


this misuse. 


Numerous attempts have been made to improve the traditional marking sys- 


tem by changing the number and meaning of the symbols used. One common 
procedure is to reduce the number of symbols to two or three. Typical reports 
of this type use letters such as H (honors), S (satisfactory), and U (unsatis- 
factory), or simply S and U. These variations have been generally unsatisfactory 
because they provide even less information concerning the pupil’s learning 
Progress. 
Another modification of the traditional system, and one which has proved 
useful, is to assign two marks for each subject. With this procedure, one set of 
letters represents achievement and the other usually represents effort, improve- 
ment, or growth in terms of potential. The obvious advantage of the dual mark- 
ing system is that you can get a more pure measure of achievement and at the 
same time obtain information about the extent to which it represents the pupil’s 
p in the right direction, but as typically used it 


best work. This system is a ste Ts: relati A ran 
still lacks information concerning the pupil’s relative success in achieving each 


of the main objectives of the course. ; f 
To provide more informative progress reports, 


Checklists of Objectives. x Š š 
Some schools have replaced the traditional marking system with a list of be- 


havior descriptions to be 
common at the elementary 
toward the major goals in | 
for reading and arithmetic 1 


checked or rated. These reports, which are most 
school level, typically include ratings of progress 
each subject matter area. The following statements 
llustrate the nature of these reports. 


READING 
1. Reads with underst 
2. Works out meaning 


3. Reads well to others. 
4. Reads independently for pleasure. 


anding. 
and use of new words. 


376 Using Evaluation Results in Teaching 
ARITHMETIC 


1. Uses fundamental processes. 

2. Solves problems involving reasoning. 
3. Is accurate in work. 

4. Works at a satisfactory rate. 


The symbols used to rate pupils on each of these specific behavioral goals 
vary considerably. In some schools the traditional A, B, C, D, E lettering system 
is retained, but more commonly there is a shift to fewer symbols, such as 
O (outstanding), S (satisfactory), and N (needs improvement) . 

The checklist form of reporting has the obvious advantage of providing a 
detailed analysis of the pupil’s strengths and weaknesses, so that constructive 
action can be taken to help him improve his learning. It also provides pupils, 
parents, and others with a frequent reminder of the goals of the school. The 
main difficulties encountered with such reports are in keeping the list of be- 
havioral statements down to a workable number and in stating them in such 
simple and concise terms that they are readily understood by all users of the 
reports. These difficulties are probably best overcome by obtaining the coop- 
eration of parents and pupils during the development of the report form. 

At the high school level, where there is a greater need for a single mark in 
each subject for administrative purposes, a common practice is to retain the 
old system and to supplement it with checklists of objectives. A typical example 
of this type of report is presented in Figure 17.2. Note that this report over- 
comes most of the shortcomings of the traditional marking system. It provides 
a single mark for achievement, a separate mark for effort, and a detailed de- 
scription of the pupil’s progress toward each of the major goals of the school. 
With this particular reporting system, the list of items on the left side of the 
rating scale section is common to all courses and the list on the right pertains 
to social studies only. Each subject-matter area has its own report from which 
includes a checklist of objectives peculiar to that particular area of study. 

Supplementary Methods of Reporting to Parents. To overcome the 
limited information provided by report cards and to improve cooperation be- 
tween teachers and parents, some schools have turned to the use of (1) informal 
letters, or (2) parent-teacher conferences. The major advantage of both methods 
is the greater flexibility in reporting. The unique strengths, weaknesses, and 
needs of each particular pupil can be stressed and plans for improvement sug- 
gested. Particular points that might cause misunderstanding on an abbreviated 
report form can be described and explained in detail. The completeness of the 
report is limited only by the data available about the pupil and the ingenuity 
of the teacher in organizing and presenting it. The parent-teacher conference 
has the additional advantage of providing parents with an opportunity to ask 
questions, describe the pupil’s home life, and discuss plans for the pupil’s further 
development. 

The usefulness of the informal letter and the parent-teacher conference as 
the sole method of reporting pupil progress is limited by several factors. (1) 


*səanəəfqo zo sistDi9əuo pue Jurjew penp saurquios Wy} uuioj Modar satsuoyeiduroo y *Z'LI IMI 


:SIN3WWO25 


luəus6pnl 19) poq ojonbopouy o — — 


gipa ç 


Pood Tan y — 


quəusbpn[ 103 sisoq ajonbapoul 0 


SIPS £ 


E Nis — z 
spom na Buod g — 


oni | 


PM Z Faaji ç 


1091955 ç 

` 19əlqns syy u! əsino2 sayy xau ayi u! pajoadxa s! Joy oj uolio| 
+ 4204j0 s duəpnis IonptAIPu! oyi jo “3942094 =91 ut pub # |OOQ$ SIMI U! $5019 sit Jo |!dnd D Jo pajoadxa s! 
BY} OF Ə|qo|!°Ap oəuəpiA9 uo posoq pwyso UD s! mojoq poB yy AYN 04 199dsƏ1 U11⁄4 JUOWOAA!YID JO əsnsoƏu p s! əpo:6 oU1 


AN3W3A3IHDY 


140433 


o n 
° n 
° n 
qisuodsas [01905 soyosysuowg O N 
Buryos 
Josnyjn2 puo jo2:Gojoucsy> 31944 u! sjuaao puo ajdoad ser01g O N S+ PƏ142931p so suoyowodəsd jnas sW O N S 
Buyung |93H142 JO s||I56 ayy u! ymos sajossuowog O N S+ suoyoanp smog O N S$ 
s|pta91ous SOIpNYs |O!2os Jo əsn puo uoli9ə|ə; əy; u! SBJDUIWIISIQ. O N S+ əayuəayos O N sS 
sioj juana uo powsojut sdoay O N S+ aBojuoapo o awy ss O N S$ 
uopssnasip dnosB aayonysuos puo Ájapio u! yymos6 sasuapia] O N S+ Ayodoid yim jnas O n S 
aBpajmouy jor podas voyy aow sy995 O N S+ ssasBoud s dnos6 103 suodsas daza% O N $S 
Aut[ou!61io puo yyGnoyy juopusdopuy sasuapiag O N S+ puo suotu!do ‘syyBu soadsoy O N S 
*yuawa6pnl 104 sisoq əjonbəpou| -Q  'Áiopaogsyosun -N  “Qopoguos -s — ‘Buipuoising + — :31V2S ƏNILVA 


əunr -~ uodəg joug kuonga = s9}s0Wa5 — 


Judy —s0y10nb pig saquianony - saponb ys] — 


sjou; 
jo 
14OdJð SSIYDOUd 


| “ouoqin 
ert 


100495 YBIH you 


S3IGNLS TWIDOS 


378 Using Evaluation Results in Teaching 


Comprehensive and thoughtful reports require an excessive amount of time 
and skill. (2) Such informal methods do not provide a systematic and cumula- 
tive report of pupil progress toward the objectives of the school. Different 
aspects of the pupil’s development are apt to be stressed from one report to 
another. (3) Letters and conference reports are difficult to summarize and 
record for use by teachers, counselors, and administrators. Such summaries, 
even if they are well done, tend to be inadequate for many administrative pur- 
poses. (4) It is seldom possible to obtain the cooperation of all parents. Some 
parents will not come for conferences, while others cannot because of work, 
illness, or other commitments. The same parents who refuse to attend confer- 
ences are likely to ignore requests for parental reply to letter reports. 

The shortcomings of the informal letter and the parent-teacher conference 
seem to restrict their use to that of a supplementary role. They are probably 
most useful in conjunction with a more formal report, like the one illustrated 
in Figure 17.2. Here they can be used to clarify and illuminate specific points 
in the report as well as to elaborate on other aspects of the pupil’s development. 


Principles for Developing an Effective 
Marking and Reporting System 


There is no universally accepted method of marking and reporting. What is 
satisfactory in one school may be unsatisfactory in another. Also, a report 
form which is useful for elementary pupils may be inadequate or infeasible at 
the high school level. Each school system must develop its own marking and 
reporting system, one that fits its particular needs and circumstances. The fol- 
lowing principles provide guidelines for this purpose. 

1. The marking and reporting system should be developed cooperatively by 
parents, pupils, and school personnel. School reports are apt to be most widely 
useful when all users of the reports have some voice in their development. This 
is usually done by organizing a committee consisting of representatives of par- 
ent groups, pupil organizations, elementary and secondary teachers, counselors, 
and administrators. Ideas and suggestions are fed into the committee through 
the representatives and the committee members carry back to their own respec- 
tive groups, for modification and final approval, the tentative plans developed 
by the committee. This cooperative participation not only tends to provide a 
more adequate reporting system but it also increases the likelihood that the 
reports will be fully understood by those for whom they are intended. 

2. The marking and reporting system should be based on a clear statement of 
educational objectives. The same objectives which have guided instruction and 
evaluation should serve as a basis for marking and reporting. Some of these 
will be general school objectives and others will be unique to particular courses 
or areas of study. Nevertheless, when developing a reporting system the primary 
question should be, “How can we best report pupil progress toward these par- 
ticular objectives?” The final report form will be limited and modified by a 
number of practical factors, but the central focus should be on the objectives 


Improving Learning, Marking and Reporting 379 


of É 
oo and the behaviors that represent the achievement of these ob- 
Fecal gir nin miw aoe 
s ss * A p uld be made to provide the type of 
stüäy of the f 2 eeded by t e users of the reports. This typically requires a 
ie A n E EA nme the reports are to be used by pupils, parents, 
e Sas pe icine ae it is seldom possible to meet 
‘heir neds ace tally g ; ps; á isfaci ory compromise is more likely if 
o 1 y known. In some instances; it may be desirable to use more 
anaes, i method. For example, a letter grade, which is easily recorded, 
x. ae : used for administrative purposes while a more elaborate report form 
or parents, teachers, and counselors. 
a oi marin and reporting system should be based on adequate evalua- 
kere ote ers should not be expected to report on aspects of pupil behavior 
Tessa “= is lacking or very unreliable. By the same token, including items 
sea a ioe form assumes that an attempt will be made to evaluate the behavior 
onde age a manner as possible. Ratings on such items as critical thinking, 
watt mple, should be the end product of testing and controlled observation, 
h er than depend on snap judgments based on hazy recollections of incidental 
appenings. Therefore, in planning a marking and reporting system, it is nec- 
essary to take into account the types of evaluation data needed. The items 
included in the final report form should be those on which we can expect teachers 


to obtain reasonably reliable and valid information. 
— oe and reporting system should be 
eran yet compact enough to be practical. For the purposes of J 
of ias and development of pupils, we should like as comprehensive a pict 
biden ir strengths and weaknesses as possible. The desire for detail must be 
ced by such practical demands as the following: (1) the amount of time 
Tequired to prepare and use the reports must be reasonable; (2) the reports 
should be clearly understandable to pupils, parents, employers, and school per- 
Sonnel; (3) the reports should be easily summarized for recording purposes. 
As noted earlier, a compromise between comprehensiveness and practicality is 
Sometimes best obtained by combining two or more reporting methods. 

6. Where letter grades are used to represent achievement, they should pro- 
vide as pure and uniform a measure of achievement as is possible. For adminis- 
trative purposes, it may be desirable to summarize the achievement of pupils in 
a single letter grade. Where this is done, the grade should pertain to the pupil’s 
level of achievement only. It should not be contaminated by such factors as 
effort, attitude, and good behavior. These latter factors may be well worth 
Teporting on, but should be separately reported. When factors other than 
achievement are included in the single letter grade, interpretation becomes 
hopelessly confused. 
one confining letter gra 

plete uniformity of meaning. 


detailed enough to be diag- 
guiding the 
ure 


ent, however, does not result in 


des to achievem 
has his own notion of what con- 


Each teacher 


380 Using Evaluation Results in Teaching 


stitutes achievement. Some may limit it to knowledge of factual information 
only, while others include a composite of knowledge, understanding, and skill. 
General agreement among the teaching staff concerning the relative emphasis to 
be given to various aspects of achievement in each area of study will do much 
to reduce this source of variation. 

Greater uniformity in the meaning of letter grades also can be obtained by 
providing general guidelines for the approximate distribution of letter grades 
to be used. It is important, however, that the suggested distribution be flexible 
enough to allow for variations in the caliber of the pupils from one course to 
another, and from one time to another in the same course. One method of pro- 
viding this flexibility is to indicate ranges rather than fixed percentages of 


pupils who should receive each letter grade. Thus, a suggested distribution might 
be stated as follows: 


A—10 to 20 per cent 
B—20 to 30 per cent 
C—40 to 50 per cent 
D—10 to 20 per cent 
E— 0 to 10 per cent 


These particular percentage ranges are presented for illustrative purposes 
only. There is no simple or scientific means of determining what these ranges 
should be for a given situation. The decision must be made by the local school 
staff, taking into account the philosophy of the school, the nature of the pupil 
population, whether ability grouping is used, and the purposes the grades’ are to 
serve. The important thing is that all staff members have a common under- 
standing of the basis for assigning grades and that this basis be made explicit 
to the users of the grades.? 


SUMMARY 


In teaching, the primary function of evaluation is the improvement of learn- 
ing. A secondary function is that of providing a basis for marking and report- 
ing pupil progress. 

The evaluation process can contribute to improved learning in a number of 
ways. (1) It can help in clarifying the goals of learning by providing precise 
operational definitions of the outcomes to be achieved, (2) It can increase our 
understanding of the abilities and needs of our pupils, their readiness for new 
learning experiences, and their progress toward the course goals. (3) It can 
facilitate pupil motivation by providing short-term goals and knowledge of 
results. (4) It can contribute to increased retention and transfer of learning 
by focusing on complex learning outcomes and by providing practice in the 


S It would be grossly unfair, for instance, to use the same distribution of grades in an 
accelerated, average, and slow grour. Adjustments must be made for level of performance. 


TR. L. Thorndike and E. Hagen, Measurement and Evaluation in Psychology and Educa- 
tion (New York: John Wiley & Sons, 1961). 


Improving Learning, Marking and Reporting 381 


application of newly learned concepts and skills. (5) It can help in the detec- 


tion, diagnosis, and remediation of learning difficulties. 

School marks and progress reports provide information which is helpful to 
Pupils, parents, and school personnel. Pupils find them useful as summary 
sam of learning progress which serve somewhat the same functions as 
i roa pm results. Parents, teachers, and counselors use the information 
s a ae | and development and in helping pupils make realistic future 
aun ponge use the information in determining promotion, athletic 
pi y, honors, and graduation. The reports also provide a basis for reporting 

other schools and to prospective employers. 

The diverse functions that progress repor 


pe to find a universally satisfactory reporting me 
Roe been tried include (1) the traditional marking system (e.g. A, B, E 
ie ee dual-marking system, (3) checklists of objectives, (4) informal 
e and (5) parent-teacher conferences. Each method has rather severe 
sai ions when used alone. Probably the best reporting system is one that 

ines a compact mark for administrative functions with a more detailed 
report for teaching and guidance purposes- In any event, some combination of 


methods seems most appropriate. 
The reporting system used in a 


its needs and circumstances. This can best be accom) 
erative action of parents, pupils, and school personnel. Special efforts should 


be made to develop a reporting system that is in harmony with the objectives 
T school, the purposes for which the reports are to be used, and the nature 
h le evaluation data available. The specific report form should be as compre- 

ensive and detailed as is practical. If a single letter grade is used to represent 
achievement, it should provide as pure and uniform a measure of achievement 
as possible. Uniform grading practices require agreement among staff members 


e : Š ç 
oncerning the bases on which grades are to be assigned. 


ts are expected to serve make it 


thod. Some of the methods 


should be patterned to fit 


particular school 
plished through the coop- 


SUGGESTIONS FOR FURTHER READING 


Ahmann, J. S., M. D. Glock, and H. L. Warderberg. Evaluating Elementary School Pupils. 
Boston: Allyn and Bacon, 1960. Chapter 14: “Diagnosis and Remediation.” Chapter 15: 

BI “Reporting Pupil Growth.” 
air, G. M. Diagnostic and 
Secondary Schools. New York: Macmillan, 1956. 

Blair, G. M., and R. S. Jones. “Readiness,” Encyclopedia of Educational Research. 3rd edi- 
tion, New York: Macmillan, 1960. Pp- 1081-1086. 

Burton, W. H. The Guidance of Learning Activities. 
Century-Crofts, 1962. Chapter 21: “The Diagnosis o: 
“Marking and Reporting Progress.” ! eee 

Davis, F. B. Educational Measurements and Their Interpretation. Belmont, California: Wads- 
worth Publishing Co., 1964. Chapter 11: “Measurement of Underachievement and Over- 


achievement.” 

Dressel, P. L., and Associates. 
1961. Chapter 3: Saupe, J. L., 
P. L., and Nelson, C. H., “Testing a! 


Remedial Teaching: A Guide to Practice in Elementary and 


New York: Appleton- 


3rd edition, 
° Chapter 22: 


f Learning Difficulties.” 


n: Houghton Mifflin, 


her Education. Bosto 
° Chapter 8: Dressel, 


Evaluation in Hig 
n Processes.” 


“Learning and Evaluatio; 
nd Grading Policies.” 


382 Using Evaluation Results in Teaching 


Findley, W. G. The Impact and Improuement of School Testing Programs, Part II, the Sixty- 
second Yearbook of the National Society for the Study of Education. Chicago: The 
University of Chicago Press, 1963. Chapter 3: Wrightstone, J. W., “The Relation of 
Testing Programs to Teaching and Learning.” 

Noll, V. H. Introduction to Educational Measurement. Boston: Houghton Mifflin, 1957. 
Chapter 14: “Using the Results of Measurement.” 

Smith, Ann Z., and J. E. Dobbin. “Marks and Marking Systems,” Encyclopedia of Educa- 
tional Research. 3rd edition, New York: Macmillan, 1960. Pages 783-789. 

Thomas, R. M. Judging Student Progress. New York: Longmans, Green, 1960. Chapter 13: 
“Marking Student Progress.” Chapter 14: “Reporting Student Progress.” Chapter 15: 
“Talking with Parents and Students.” 

Wrinkle, W. L. Improving Marking and Reporting Practices in Elementary and Secondary 
Schools. New York: Holt, Rinehart & Winston, 1947. This is an older reference, but it 
is one of the most comprehensive and thorough 
importance of relating marking and reporting p 
and to the functions the Teports are to serve. 


treatments available. It emphasizes the 
ractices to the objectives of the school 


Test Bulletins 


Durost, W. N. How to Tell Parents About Standardized Test Results. Test Service Note- 
book, No. 26, New York: Harcourt, Brace & World, 1961. 

Ricks, J. H. On Telling Parents About Test Results. Test Service Bulletin, No. 54, New 
York: The Psychological Corporation, 1959. 


T 


PPD 


Appendix A 
elementary statistics 


. sinc hap = with the organization, analysis, and interpretation of 
asss e Aon ing data. As a minimum, a teacher should know 
H acl si techniques which enable him to (1) analyze and describe the 
iieo obtained in his own classroom, (2) understand the 
asqa. i in test manuals and research reports, and (3) interpret the various 
Garcia erived scores used in testing. Knowing how to make basic statistical 
deg tie is probably most directly useful in the first area, but it should 
Man “i to a greater understanding of statistical descriptions of data. 
eon i eachers shy away from statistics because they think it involves ad- 
ed mathematics. The elementary statistical concepts and skills we shall deal 


wi ; š 

ith here involve just three things. 
1. A 

: Knowledge of new terms. This you would expect to encounter in any new 
Tea you enter. 

A =a nou age of statistical symbols. 
ymbols represent words or brief descriptions 

average). 

m Simple arithmetic skills. Statis 

wi s as addition, subtraction, squaring, 

pa ea Es would help you understand how t 
e practical application of the formu 


understanding. 


rthand, where 


This is simply a type of sho 
or arithmetic 


(e.g, M = mean, 


tical computation requires the use of such 
and the like. Although advanced 
he statistical formulas are derived, 


las can be made without this deeper 


The statistical measures we shall be concerned with here are: 
_ Measures of central tendency (averages). 
. Measures of variability (spread of scores). 
3. Measures of relationship (correlation). 
ovide a convenient means o 


nd the last measure can 
est scores obtained for the sa 


f analyzing and describ- 
be used to indicate the 
me pupils. 


k The first two measures pr 

ares 
i 8 a single set of test scores a 
reement between two sets of t 
385 


386 Appendix À 


ANALYZING UNGROUPED SCORES 


When test scores are obtained for a group of pupils they are usually in hap- 
hazard order, as shown in Table A.1. 


Table A.1 
SET OF SCORES FOR TWELVE PUPILS 
Name Score Name Score 
1. John A. 27 7. Henry J. 30 
2. Bill B. 33 8. Susan K. 20 
3. Mary C. 40 9. Helen M. 33 
4. Betty E. 25 10. Dick N. 28 
5. George F. 28 11. Jim R. 28 
6. Marie G. 36 12. Mike S. 32 


Such a set of scores may be analyzed directly, by simply rearranging them 
in order of size. This procedure is satisfactory where the number of scores is 
small (less than 20 or 25). With a large number of scores it is more convenient 
to group the scores into a frequency table before analyzing them. Here we shall 
describe and illustrate statistical analyses of ungrouped scores. In the following 
section the same statistical procedures will be illustrated with grouped scores. 


Simple Ranking 


For some uses, it may be sufficient to arrange a set of scores in order of size 


and to assign a rank to each score. This will indicate the relative position of 
each score in the group. Ordinarily the largest score is given a rank of 1, the 
second largest a rank of 2, and so on, until all scores are ranked. The scores 
from Table A.l have been rearranged in order of size and assigned ranks to 
illustrate the procedure. The results are presented in Table A.2. 


Table A.2 
RANKING TEST SCORES 
Score Rank 
40 1 
36 2 
33 35 | Score 33 is tied 
33 35 f for Ranks 3 and 4. 
32 5 
30 6 
m s Score 28 is tied 
28 8 for Ranks 7, 8, 9. 
27 10 
25 Jı 
20 12 


Elementary Statistics 387 


Special problems arise in anking when two or more scores are tied for th 
p! O. 
P ems arisi T; g f e 


same rank. Note in Table A.2 for example that the score 33 appears twice and 
; a y assigned to rank 3 and rank 4. Since the sco 
1 the ranks are averaged and each score is assigned a rank of 35. The 
cack n score (32) is then assigned a rank of 5 because the third and fourth 
Sa poi = = occupied. The same procedure of averaging is applied to the 
pool E E They are tied for ranks 7, 8, and 9, so each is assigned the 
s ie : rank of 8, and the next lowest score 1s ranked 10. Note that the rank 
alle coe score (12) equals the total number of scores (V = 12) being 
. This provides a good means of checking whether the scores have been 


ranked correctly. 


Measures of Central Tendency 


fills the positions ordinaril 


ply an average or typical value in a 
“arithmetic average” which is ob- 


d dividing this sum by the number 
and is represented 


A measure of central tendency is sim 
Set of scores. We are all familiar with the 
tained by adding all of the scores in a set an 
of scores. In statistics this type of average is called the mean, 
by the letter M (or X). Two other commonly used measures of central tendency 
are the median (represented by Mdn or Ps.) and the mode. The median is the 


midpoint of a set of scores. That is, the point on either side of which half the 
scores occur, The mode (“fashion occurs most frequently. 


Si *) is the score which 
ince the mean, median, and mode are di f averages, the word 


= fferent types o 
ao should be avoided when describing data. Preciseness requires that 
e specific type of average be indicated. 
d The method of determining each measu 
escribed below and illustrated in Table A.3. 
is the most widely 


The Mean (M or X). The mean, or arithmetic average, 
dding a series of 
the computation 


used measure of central tendency. Since it is determined by a 
s near š 

ee and then dividing this sum by 
rom ungrouped data can be represente 


_ Sum of all scores M 
— Number of scores 


re of central tendency will be briefly 


the number of scores, 
d by the following formula: 
_ 3x 

N 


in which 


Z = the sum of 
X = any score 


N = number of scores 
3, a mean of 30 is obtained. 


Applying this formula to the scores in Table A. 
ount the value of each score. One extremely 
ffect on the mean. 


Note that the mean takes into ace 

high or low score could have an appreciable e 

The Median (Mdn or P.s). The median is a “counting average.” It is 

determined by arranging the scor f size and counting up to (or down 

to) the midpoint of the set of s f scores is even (as in 

in A.3) the median is halfw ddlemost scores. When 
e number of scores is odd, the me 


es in order o 
cores. If the number o 
ay between the two mi 
dian is the middle score. 


388 Appendix À 


Table A.3 
MEASURES OF CENTRAL TENDENCY 


Score 
(X) 
40 
36 
50% of 33 
scores 33 


Median = 29 > 


28 Mode = 28 
50% of 28 
scores 27 


It should be noted that the median is a point which divides a set of scores into 
equal halves. The same number of scores fall above the median as below the 
median, regardless of the size of the individual scores. Since it is a countinë 
average, an extremely high or low score will not affect its value. 

The Mode. The mode is simply the most frequent or popular score in the 
set, and is determined by inspection. In Table A.3 the mode is 28 since the larg- 
est number of persons made that score. The mode is the least reliable type of 
Statistical average and is frequently used merely as a preliminary estimate of 
central tendency. A set of scores sometimes has two modes, and is called bimodal. 


Measures of Variability 


A set of scores can be more adequately described, if we know how much they 
spread out above and below the measure of central tendency. For example, we 
might have two groups of pupils with a mean IQ of 110, but in one group the 
span of IQs is from 100 to 120 and in the other the span is from 80 to 140. 
These would represent quite different ability groups. We can identify such 
differences by numbers which indicate how much scores spread out in a group: 
These are called measures of variability, or dispersion. The three most com- 
monly used measures of variability are the range, the quartile deviation, and the 
standard deviation. 

The Range. The simplest and crudest measure of variability is the range- 
This is obtained by subtracting the lowest score from the highest score. In the 
example given above, the range of IQs in the first group is 20 points, in the 
second, 60 points. The range provides a quick estimate of variability but it i5 
undependable, because it is based on the position of the two extreme scores. The 
addition or subtraction of a single end-score can change the range significantly- 


Elementary Statistics 389 


In the above example, the ranges of the two groups would become equal, if we 
added to the first group a pupil with an IQ of 80 and another with an IQ of 
140. It is obvious that a more stable measure of variability would be desirable. 

The Quartile Deviation (Q). The quartile deviation js based on the range 
of the middle 50 per cent of the scores, instead of the range of the entire set. The 
middle 50 per cent of a set of scores is called the interquartile range and the 
quartile deviation is simply half of this range. The quartile deviation is also 
called the semi-interquartile range. 

The middle 50 per cent of the scores is bounded by the 75th percentile and 
the 25th percentile. These points are called quartiles and are indicated by Qs; and 
Q, respectively. Quartiles are merely points which divide a set of scores into 
quarters. The middle quartile, or Q... is the median. 

To compute the quartile deviation, we simply determine the values of Q; and 
Q, and apply the following formula: 

Q. — Q. 
Q == 

We use the same counting procedure to locate Qs and Q, 
the median. With the scores arranged in order of size, we start from the lowest 
score, and count off 25 per cent of the scores to locate Q, and 75 per cent of 
the scores to locate Qs. This has been done in Table A.4. Note that Q, is 27.5 
because it falls halfway between 97 and 28, and that Q. is 33 because the two 
scores it falls between are both 33. Note also that when the median (or Q.) is 


indicated, the set of scores is divided into quarters. 
While quartiles are points on the scale (like average 

quartile deviation represents a distance on the scale. It in 

need to go above and below the median to include approximat! 


50 per cent of the scores. 

The Standard Deviation (SD, s, or o). The most useful measure of vari- 
ability, or spread of scores, is the standard deviation. The computation of the 
standard deviation does not make its meaning readily apparent, put essentially 
degree to which a set of scores deviates from the mean. 
unt that each score deviates from the mean, 


of variability than the range or quartile 


that we used to find 


s and percentiles), the 
dicates the distance we 
ely the middle 


it is an average of the 
Since it takes into account the amo 
it provides a more stable measure 
deviation. 
The procedure for computing the SD is illustrated in Table A.4, and it in- 
cludes the following steps- (1) Subtract the mean from each score to obtain 
the deviations (x) of the scores from the mean. (2) Square each of the devia- 
inated by squaring) - (3) Add 


tions to obtain x? (note that minus signs are elim : i 
these squares to obtain Zx. (4) Divide =x" by the number of scores (N) an 
take the square root the standard deviation can be ex- 


1 The computation of 
pressed by the following simple formula: 


so = 
N 


a table of square roots which may be used for this purpose- 


1 Appendix B contains 


390 Appendix À 


Table A.4 
MEASURES OF VARIABILITY 
Score Deviation Dev. Squared 
(X) (z) (22) 
40 +10 100 
36 + 6 36 
33 a 9 
Qs = 33—:=;À<— 33 Ta g 
32 +2 4 
Mdn = 29 30 0 0 
(Qs) = 28 — P. 4 
28 = 9 4 
28 — 2 4 
Q. = 275 —— 8 j 
25 _ 5 25 
20 —10 100 
M = 30 Bx? = 304 
Q — Q x 
g s sa SD = 7 
33 — 275 B 
tamg sp = _ |3% 
5.5 12 
C= SD = V25.22 
Q = 275 SD = 5.02 


The standard deviation, like other measur 
tance on the scale. In a normal distribution, 
Score points that we need to go above and b 
mately 68 per cent of the scores (call it two t 
the standard deviation, and illustrations of it: 
see Chapter 14. 

Which Measure of Dispersion to Use. 


the median and is satisfactory for analyzing a small number of scores. Since 
these statistics are obtained by counting, and thus are not affected by the value 
of each score, they are especially useful where one or more scores deviate 
markedly from the others in the set. 

The standard deviation is used with the mean. It provides the most reliable 
measure of variability and is especially useful in testing. In addition to describ- 
ing the spread of scores in a group, it serves as a basis for computing standard 
scores, the standard error of measurement, and other statistics used in analyzing 
and interpreting test scores. 


es of variability, represents a dis- 
it is equivalent to the distance in 
elow the mean to include approxi- 
hirds). For other interpretations of 
s use in computing standard scores, 


The quartile deviation is used with 


COMPUTING STATISTICS FROM GROUPED DATA 


In the previous section we described how to compute statistics from ungrouped 
data. Such procedures are especially helpful in clarifying the meaning of statis- 
tical measures and they are useful where the number of scores to be analyzed 
is small. When working with twenty-five or more cases, however, it is usually 


Elementary Statistics 391 


Apes š 
esirable to group the scores into a frequency distribution before making sta- 


tistical computations. 


Grouping Scores into a Frequency Distribution 


y a method of organizing test scores to sim- 


A frequency distribution is simpl 
plify statistical analysis. An example of a frequency distribution is shown in 


Table A.5. Note that the scores have been grouped by class intervals, the number 
of scores falling in each interval has been tallied, and the tallies have been 
counted to obtain the frequency, or number of scores, in each interval. Thus, 
there is one score in the interval 18-20, two scores in the interval 21-23, and 
so on. The total number of scores (N) is the sum of the numbers in the fre- 
quency column. In the finished table, the tally column is usually omitted. 


Table A.5 


FREQUENCY DISTRIBUTION OF FORTY TEST SCORES 


Class Interval Tally Frequency 
48-50 / 
45-47 / 
42-44 / 
39-41 HHL 
36-38 HH. 
33-35 HH 
PH 
Hil 
I 
// 
/ 


30-32 
27-29 
24-26 
21-23 
18-20 


_N 
= to we > = co O, mn to = — 


= 
II 
è 


The frequency distribution in Table A.5 illustrates a number of points that 
should be observed during construction. 


intervals is usually somewhere between 10 


l. A satisfactory number of class 


and 20. 
2. The most convenient size to use for the class interval 
by dividing the total score range by 15 and taking the nearest odd number. An 


odd number is preferred, so that the midpoint of each interval will be a whole 
number. For example, the range of scores tabulated in Table A.5 was 49 — 18 
= 31. Dividing 31 by 15, we obtain 2.07. The nearest odd number is 3, so that 
was selected as the size of the interval. Note that the midpoint of the lowest 
interval is 19, the midpoint of the next highest interval is 22, and so on. The 
midpoint of any interval can be determined by adding the score limits of the 


interval and dividing by 2. 
3. Allintervals in the same ta 
4. The score limits of the inte 
and so on). 


can be determined 


ple should be of equal size. 


rvals should not overlap (e.g, 


18-20, 21-23, 


392 Appendix A 


5. The intervals should be arranged in order, with the highest values at the 
top of the table. f 

6. The lower score limit of any interval should be a multiple of the size of 
the interval (e.g., 3 X 6 = 18,3 X 7 = 21, and so on). 

7. To determine the size of the interval that has been used in a frequency 
distribution, simply subtract the lower score limit of one interval from the lower 
score limit of the interval just above it (e.g., 21 — 18 = 3, 24 — 21 = 3, and 
so on). This also provides a good check on the accuracy of the class intervals 
in a newly constructed frequency table. 


Some of the above suggestions may be modified to fit special needs, but they 
provide useful guidelines for the beginner. 

The Limits of a Class Interval. In a frequency distribution, the interval 
limits are written as scores. For example, the bottom interval in Table A.5 
is written 18-20. A test score, however, represents the midpoint of a distance 
extending half a unit below and half a unit above the given score value. Thus, a 
score of 18 extends from 17.5 to 18.5, and a score of 20 from 19.5 to 20.5. 
Therefore, the actual or real limits of the score interval 18-20 extend from 17.5 
to 20.5; the real limits of the next highest interval extend from 20.5 to 23.5, 
and so on. The score limits are written for convenience but the real limits must 
be used in certain statistical computations, as we shall see shortly. 


Graphic Presentations of Frequency Distributions 


A frequency distribution presents test data in a clear, effective manner, and 
it is satisfactory for most classroom purposes. However. 
the distribution of scores more carefully, 


graphic representation may be more useful. 


are the histogram (or bar graph) and the frequency polygon (or line graph). 
Both graphs are presented in Figure A.1, based on the data in Table A.5. The 
scores are shown along the baseline, or horizontal axis, and are grouped into 
the same class intervals used in Table A.5. The vertical axis, to the left of the 
graphs, indicates the number of pupils earning each score, Thus it corresponds 
to the frequency column in Table A.5. 

Note that the histogram presents the data in the form of rectangular columns. 
The base of each column is the width of the class interval and the height of the 
column indicates the frequency, or number of pupils falling within that interval. 
It is as if each pupil earning a score within a given class interval were standing 
on the shoulders of the pupil beneath him, to form a “human column.” 

The frequency polygon is constructed by plotting a point at the midpoint of 
each class interval at a height corresponding to the number of pupils, or fre- 
quency, within that interval, and then joining these points with straight lines. 


As can be seen in Figure A.1, the frequency polygon and histogram are simply 
different ways of presenting the same data. In 
course, use only one 


> if we desire to study 
or to report the results to others, a 
The two most commonly used graphs 


actual practice we would, of 
of the graphs; the choice being somewhat arbitrary. 


Elementary Statistics 393 


—— Histogram 


------ Frequency Polygon 


Number of Pupils 


18-20 21-23 24-26 27-29 30-32 33-35 36-38 39-41 42-44 45-47 48-50 


Scores 


Figure A.l. Histogram and frequency polygon. (Plotted from data in Table AS.) 
Quartile Deviation 

e median (Mdn) is the 50 per cent point 
tly in half) ; that the quartile devia- 
(25th percentile) and Qs (75th 


Computing the Median and 


Ta an earlier section we noted that th 
e point that divides a set of scores exac 


ti ; q 
lon (Q) is one half the distance between Q: 
percentile) ; and that both statistical measures are based on counting. That is, 


Starting from the low end of a set of scores, We count off one fourth of the scores 
to locate Q,, one half of the scores to locate the median (Q:); and three fourths 
of the scores to locate Qs- This same counting procedure is used when computing 
the median and quartile deviation from a frequency distribution but some ad- 
justments are necessary because the scores are grouped into intervals. The basic 
steps involve counting UP to the interval containing the quartile in which we 
ae interested (Qı, Mdn, or Q.) and then determining what proportion of the 
interval must be added to the lower limit of the interval to locate the exact 
quartile point. The step-by-step procedures for computing the median and 


quartile deviation are listed below and illustrated in Table A.6. 


To compute the median (Mdn) : 
(40 + 2 = 20). 


(N) of scores by 2 
add the scores in each 


1. Divide the total number 
_ 2. Starting at the low end of the frequency column, 
interval until the interval containing the median is reached (1 + 2 +344 


+7 = 17). 


394 Appendix A 


3. Subtract the sum (S) obtained in Step 2 from the number required to 
reach the median (20 — 17 = 3). 

4. Find the proportion of the median interval that is to be added to its lower 
limit by dividing the number obtained in Step 3 by the number of scores in 
the median interval and multiplying this by the size of the class interval (34 X 3 
= W: . 

5 An the amount obtained in Step 4 to the real lower limit of the median 
interval. This sum is the median (32.5 + 1.13 = 33.6). 


These steps can be expressed by the following formula: 


in which 
L = real lower limit of median interval 
N = the total number of scores 
S = sum of scores in intervals below L 


f = frequency, or number of scores, in median interval 
i = size of the class interval 


The steps used to compute the median can also be used to locate Q, and Qs, by 
simply modifying step one to fit the percentage of scores being counted off, and 
by substituting the word quartile for median wherever it appears in the pro- 
cedural steps. The formulas for Q, and Q, presented below, make clear the 
changes in Step 1. Note that N/2 used in the formula for the median has been 
changed to N/4 in the Q. formula and 3N/4 in the Q; formula. The formulas, 
and procedures, are alike in all other respects, 

To compute the quartile deviation (Q): 


1. Find the value of Q, and Q, by applying the following formulas: 


aan + (SO =8 x) @ = 1 + (EPE xi) 


The letters in these formulas have the same meaning as those used in the formula 
for the median but, of course, the letters L and Í now refer to the interval d 
which the particular quartile falls. For the scores in Table A.6, Q, = 29.5 an 
Q; = 38.0. 


2. Subtract Q, from Q, and divide by 2. This is expressed in the following 
formula for the quartile deviation. 


_ @ — 0. 
o= == 


The computation in Table A.6 results in a Q of 4.25. 
Note that we can now describe the distribution of scores in Table A.6 by 
stating that it has a median of 33.63 and a quartile deviation of 4.25. If we 


Elementary Statistics 395 


Table A.6 


COMPUTATION OF THE MEDIAN AND QUARTILE DEVIATION 
FROM A FREQUENCY DISTRIBUTION 


Class Interval Frequency 
48-50 1 
45-47 t 
42-44 2 
39-41 5 
36-38 6 
33-35 8 
30-32 7 ` 7 
27-29 4 a Š 
24 p — | 
2 3 — (Mdn) 
21-23 2 (Q) 
18-20 1 
N = 40 
n 
Mdn = 325 +42 = 325 + 1.13 = 33.63 
= — E. 
8 
sO 10 
Q = 295 +4 4 x3 = 295 + 0 = 29.5 
7 
sS 40 __ 25 
Qs = 35.5 + 4 "3 35.5 + 2.50 = 38.0 
6 
38.0 — 29. 5 
pe 85 _ 495 
2 2 


remember that the median is the midpoint and that one quartile deviation above 
tely the middle 50 per cent of the 


and below the median includes approxima 

Scores (exactly in a normal distribution), this brief statistical description pro- 
vides a meaningful substitute for the entire distribution of scores. 

Deviation 

metic average which is determined by 
ber of scores, and the standard devia- 
res around the mean. The 
rouped scores clearly define 
from a frequency dis- 
mulas are modified 


Computing the Mean and Standard 


As noted earlier, the mean is the arith 
dividing the sum of the scores by the num! 
tion is a measure of variability oT spread of sco 
formulas used in computing these statistics from ung; 
their meaning but they are inconvenient for computing 
tribution. For these computations the procedures and for 
slightly. The results, of course, have exactly the same meaning, whether com- 


puted from ungrouped or grouped scores. 


396 Appendix A 


At first glance, the procedure for computing the mean and standard deviation 
from a frequency table may appear complicated, but it is merely a matter of 
following a series of simple steps. These are listed below and illustrated in 


Table A.7. 
Table A.7 


COMPUTATION OF THE MEAN AND THE STANDARD DEVIATION 
FROM A FREQUENCY DISTRIBUTION 


Class Frequency Deviation* ° 
Interval H (d) ja fa 
48-50 1 5 5 25 
45-47 1 4 4 16 
42-44 2 3 6 18 
39-4] 5 2 10 20 
36-38 6 1 6 6 
33-35 8 0 0 0 
30-32 7 —1 —7 7 
27-29 4 —2 = 16 
24-26 3 —3 =9 27 
21-23 2 —4 —8 32 
18-20 1 =§ a 25 

N= 40 Ifd = —6 sfd* = 192 
€ (correction) = =e —.15 
on) = ag n š 


M = 34 + (—.15 X 3) = 34 — 45 = 33.55 


2 a 
SD = | e — (—.15)° = 3V380 — 02 = 3VI75 


SD = 3 x 2.19 = 6.57 


* Deviation in interval units is also commonly expressed by x’. 
To compute the mean (M) and standard deviation (SD) 


1. Select any class interval near the middle of the distribution and call this 
interval zero in the (d) column. The midpoint of this interval is the assumed 
mean, or AM (in Table A.7, the zero interval is 33-35, and the AM = 34). 

2. Determine the deviation (d) of each interval from the zero interval and 
enter in the (d) column. 


3. In each row, multiply the entry in the f column by the entry in the d col- 
umn, and enter the result in the fd column. 


4. In each row, again, multiply the entry in the fd column by the entry in the 
d column, and enter the result in the fd? column. 


Elementary Statistics 397 


— a the entries in the fd column to obtain Sfd (in Table A.7, S/d = —6) 
. the entries in the fd? column to obtain Xfd* (in Table A.T Sha? = 


192). 
7. Substitute the obtained values in the following formulas: 
Zid 
N 
M = AM + (c X ü 


spa; [UF oe 
N 


the assumed mean 

the correction 

size of the class interval 

ble A.7 indicate that this distribution of scores 
f 33.55 and a standard deviation of 6.57. Thus 
o thirds of the scores to fall between 27 and 40 
The percentage of cases falling within 
normal curve and the use of SD in 


n detail in Chapter 14. 


a 
H H H 


The results presented in Ta 
can be described by a mean o 
We can expect approximately tw 
(M + SD, rounded to whole numbers). 
each standard deviation area under the 
computing standard scores is described i 


COMPUTING THE COEFFICIENT OF CORRELATION 


The meaning of correlation coefficient and its use in describing the validity 
and reliability of test scores can be found in Chapters 4 and 5. Basically, a 
coefficient of correlation expresses the degree of relationship between two sets 
of scores by numbers ranging from +1.00 to —1.00. A perfect positive correla- 
tion is indicated by a coefficient of +1.00 and a perfect negative correlation by 
a coefficient of —1.00. A corre ent of .00 lies midway between 


these extremes and indicates no rel e two sets of scores. 
Obviously, the larger the coefficient (positive or n the higher the degree 


of relationship expressed. 
Two of the most common m 
are the rank-difference metho 


difference method, which is described in t 
Satisfactory if the number of scores to be correlated is small. For most class- 


room purposes it provides a simple, practical technique. The product-moment 
method is favored where the number of scores is large. Thus, it is the type of 
correlation that is most com d research studies. 


monly reported in test manuals an 
The product-moment coefficient is indicated by the symbol r. ' j 
The computation of the product- rrelation will be 


moment coefficient of co 
described and illustrated, here, for ungr The computation 


ouped test scores. 
with grouped data appears more complicate h more detailed 


d and requires a muc 
description. This can be obtained from any standard textbook in statistics. A 
few such references are listed at the end of this u 


a 


lation coeffici 
ationship between th 
egative), 


efficient of correlation 
method. The rank- 
he computing guide in Chapter 4, is 


ethods of computing the co 
d and the product-momen 


nit. 


398 Appendix À 


Table A.8 
PRODUCT-MOMENT CORRELATION FOR UNGROUPED DATA 
i serg ya = £ 
119 77 14,161 5,929 
118 76 13,924 5,776 vt 
116 72 13,456 5,184 sa 
115 67 13,225 4,489 pa 
112 82 12,544 6,724 z 
109 63 11,881 3,969 6,8 4 
108 60 11,664 3,600 6,48 
106 78 11,236 6,084 8,268 
105 69 11,025 4,761 7,245 
104 49 10,816 2,401 5,096 
102 48 10,404 2,304 4,896 
100 58 10,000 3,364 5,800 
98 56 9,604 3,136 5,488 
97 57 9,409 3,249 zire 
95 74 9,025 5,476 7080 
94 62 8,836 3,844 58 I 
93 46 8,649 2.116 4,27 
91 65 8,281 4,225 5,915 
90 59 8,100 3,481 5,310 
89 54 7,921 2,916 4,806 
2,061 1,272 214,161 83,028 132,208 
(2X) (2Y) (2X?) (SY?) (2XY) 
N = 20 


è P š P t 
The following steps will serve as a guide for computing a product-momen 
correlation coefficient (r) from ungrouped data.? 


. wet x " . e 
l. Begin by writing the pairs of scores to be studied in two columns. Maka 
certain that the pair of scores for each pupil is in the same row. Call one colum! 


X and the other Y (see Table A.8). 


2. Square each of the entries in the X column and enter the result in the X 
column. 


3. Square each of the entries in the Y column and enter the result in the Y 
column. 


. 1- 
4. In each row, multiply the entry in the X column by the entry in the Y co 
umn, and enter the result in the XY column. 


5. Add the entries in each column to find the sum of (3) each column. Note 
the number (N) of pairs of scores. From Table A.8, then: 


=X = 2,061 EXX? = 214,161 
zY = 1,272 ZY*= 83,028 
N=20 EXY = 132,208 


2 Computation is simplified b 


dix B. 


y the use of the table of squares and square roots in Appen: 


Elementary Statistics 399 


6. Substitute the obtained values in the following formula: 
ZXY _ (2xV2¥ 
N NAN 
a E t C ey 
N N N N 


This formula looks complex, but it involves simple arithmetic. Since the sum 
of each column is divided by N, this step can be completed before putting the 
data into the formula. Thus, for the data from Table A.8, 


N 
ZY 12722 _ 63.60 
N 20 
ZX _ 214161 _ 10,708.05 
N 20 
ZY 89028 _ 4151.40 
N 20 
ZXY _ 132,208 _ 6,610.40 
N 20 
Then, 
a 6,610.40 — (103.05) (63.60) 
V10,708.05 — (103.05)? V4,151.40 — (63.60) ° 
a 6,610.40 — 6,553.98 
V10,708.05 — 10, 619.30 V 4,151.40 — 4044.96 
„—_ 5642 _ 5642 __ 5642 
V88.75 V106.44 9.42 X 1032 97.21 
r= 58 


he above formula, the computations 


Although it is not readily apparent from t 
deviation of each set of scores (X 


involve finding the mean and the standard 
and Y). Thus the formula can also be written 


= — (M. (M) 


a 
r = (SD. (SD) 
in which 
Mx means of scores in X column 


v 
SD, standard deviation of scores in X column 


M, = mean of scores in Y column 
SD, = standard deviation of scores in Y column 


Thus, for the same data 


_ 6,610.40 — (103.05) (63.60) _ 58 
9.42 x 10.32 I 


400 Appendix A 


If the means and standard deviations are already available for the two sets of 
scores, this latter formula is easier to apply. If they are not available, the first 
formula can be used, and the means and standard deviations of the two sets 
of scores are computed in the process. (This can be seen by comparing the two 
formulas.) 

It should be noted that the scores in Table A.8 are the same scores used to 
illustrate rank-difference correlation in Chapter 4. There we obtained a coeffi- 
cient of .60, in comparison to the .58 obtained here. Differences this large par 
be expected from the two methods, but seldom is the difference larger than this 
(unless there are many ties in rank). Also, for all practical purposes the two 
types of correlation coefficients can be interpreted in the same manner. For a 
description of how to interpret and use the coefficient of correlation see 
Chapter 4. 

A Final Caution. Correlation indicates the degree of relationship between 
two sets of scores, but not causation. If X and Y are related, there are several 
possible explanations: (1) X may cause Y, (2) Y may cause X, or (3) X and Y 
may be the result of a common cause. For example, the increase in incidence 
of juvenile delinquency during the past decade has been paralleled by a cor- 
responding increase in teachers’ salaries. Thus, the correlation between these 
two sets of figures would probably be quite high. Obviously, further study is 
needed to determine the cause of a particular relationship. 


BASIC REFERENCES 
Bloomers, P. J., and E. F. Lindquist. Elementary Statistical Methods in Psychology and 
Education. Boston: Houghton Mifflin, 1960. 
Garrett, H. E. Elementary Statistics. 2nd edition, New York: David McKay, 1962. 


Guilford, J. P. Fundamental Statistics in Psychology and Education. 4th edition, New York: 
McGraw-Hill, 1965. 


Walker, H. M., and J. Lev. Elementary Statistical Methods. Revised edition, New York: 
Holt, Rinehart & Winston, 1958. 


SELF-INSTRUCTIONAL GUIDES 
Bradley, J. I., and J. N. McClelland. Basic Statistical Concepts: A Self-Instructional Text. 
Chicago: Scott, Foresman & Co., 1963, 168 pages. 


Townsend, E. A., and P. J. Burke. Statistics for the Classroom Teacher: A Self-Teaching 
Unit. New York: Macmillan, 1963, 68 pages. 


402 Appendix B 
Table of Squares and Square Roots 
N N Ne N JN” T N 
1 101 10201 1 1.000 
2 102 10404 2 1.414 
3 103 10609 3 1.732 
4 104 10816 4 2.000 
5 105 11025 5 2.236 
6 106 11236 ó 2.449 
7 107 11449 7 2.646 
8 108 11664 8 2.828 
9 109 11881 ? 3.000 
10 110 12100 10 3.162 
1 111 12321 11 3.317 
12 112 12544 12 3.464 
13 113 12769 13 3.606 
14 114 12996 14 3.742 
15 115 13225 15 3.873 
16 116 13456 16 4,000 
17 117 13689 17 4.123 
18 118 13924 18 4,243 
19 119 14161 19 4.359 
20 120 14400 20 4.472 
21 121 14641 21 4.583 
22 122 14884 22 4.690 
23 123 15129 23 4.796 
24 124 15376 24 4.899 
25 125 15625 25 5.000 
26 126 15876 26 5.099 
27 127 16129 27 5.196 
28 128 16384 28 5.292 
29 129 16641 29 5.385 
30 130 16900 30 5.477 
31 131 17161 31 5.568 
32 132 17424 32 5.657 
33 133 17689 33 5.745 
34 134 17956 34 5.831 
35 135 18225 35 5.916 
36 136 18496 36 6.000 
37 137 18769 37 6.083 
38 138 19044 38 6.164 
39 139 19321 39 6.245 
40 140 19600 40 6.325 
41 141 19881 41 6.403 
42 142 20164 42 6.481 
43 143 20449 43 6.557 
44 144 20736 44 6.633 
45 145 21025 45 6.708 
46 146 21316 46 6.782 
47 147 21609 47 6.856 
48 148 21904 48 6.928 
49 149 22201 49 7.000 
50 150 22500 50 7.071 100 


Table of Squares and Square Roots 


Toble of Squore Roots 


17.378 
17.407 
17.436 
17.464 


17.493 
17.521 
17,550 
17.578 
17.607 


17.635 
17.664 
17.692 
17.720 
17.748 


17.776 
17.804 
17.833 
17,861 
17,889 


17.916 
17.944 
17.972 
18.000 
18.028 


18.055 
18.083 
18.111 
18.138 
18.166 


18.193 
18.221 
18,248 
18.276 
18.303 


18.330 
18.358 
18.385 
18.412 
18.439 


18.466 
18.493 
18.520 
18,547 
18.574 


18.601 
18.628 
18.655 
18.682 
18.708 


403 


404_ Appendix B 


Table of Square Roots 


21.237 
21.260 
21.284 
21.307 
21.331 


21.354 
21.378 
21.401 
21.424 
21.448 


21.471 
21.494 
21.517 
21.541 
21.564 


21.587 
21.610 
21.633 
21.656 
21.679 


21,703 
21.726 
21.749 
21.772 
21.794 


21.817 
21.840 
21.863 
21.886 
21.909 


_ 21.932 


21.954 
21.977 
22,000 
22.023 


22.045 
22.068 
22.091 
22.113 
22.136 


22.159 
22.181 
22.204 
22,226 
22.249 


22.271 
22.293 
22.316 
22.338 
22.361 


25.515 
25.534 
25.554 
25.573 
25.593 


25.612 
25.632 
25.652 
25.671 
25.690 


25.710 
25.729 
25.749 
25.768 
25.788 


25.807 
25.826 
25,846 
25.865 
25,884 


25.904 
25.923 
25.942 
25.962 


25.981 


26.000 
26.019 
26.038 
26.058 
26.077 


26.096 
26.115 
26.134 
26.153 
26.173 


26.192 
26.211 
26.230 
26.249 
26. 268 


26,287 
26.306 
26.325 
26.344 
26.363 


26.382 
26.401 
26.420 
26.439 


. 26,458 


n 


Table of Squares and Square Roots 


Table of Squore Roots 


405 


798 28.249 | 848 29.120 898 
799 28.267 | 849 29.138 899 
800 28.284 | 850 29.155 900 


“| — | W gat | m ;š 
N JN N JN” N JN 
751 27.404 | 801 28.302 | 851 29.172 
752 27.423 | 802 28.320 | 852 29.189 
753 27.441 | 803 28.337 | 853 29.206 
754 27.459 | 804 28,355 | 854 29.223 
755 27.477 | 805 28.373 | 855 29.240 
756 27.495 | 806 28.390 | 856 29.257 
757 27.514 | 807 28.408 | 857 29.275 
758 27.532 | 808 28.425 | 858 29.292 
759 27.550 | 809 28.443 | 859 29.309 
760 27.568 | 810 28.460 | 860 29.326 
761 27.586 | 811 28.478 | 861 29.343 
762 27.604 | 812 28.496 | 862 29.360 
763 27.622 | 813 28.513 | 863 29.377 
764 27,641 | 814 28.531 | 864 29.394 
765 27.659 | 815 28.548 | 865 29.411 
766 27.677 | 816 28.566 | 866 29.428 
767 27.695 | 817 28.583 | 867 29.445 
768 27.713 | 818 28.601 | 868 29.462 
769 27.731 | 819 28.618 | 869 29.479 
770 27.749 | 820 28.636 | 870 29.496 
771 27.767 | 821 28.653 | 871 29.513 
772 27.785 | 822 28.671 | 872 29.530 
773 27.803 | 823 28.688 | 873 29.547 
774 27.821 | 824 ‘28.705 | 874 29.563 
775 27.839 | 825 28.723 | 875 29.580 
776 27.857 | 826 28.740 | 876 29.597 
777 27.875 | 827 28,758 | 877 29.614 
778 27.893 | 828 28.775 | 878 29.631 
779 27.911 | 829 28.792 | 879 29.648 
780 27.928 | 830 28.810 | 880 29.665 
781 27.946 | 831 28.827 | 881 29.682 
782 27.964 | 832 28.844 | 882 29.698 
783 27,982 | 833 28.862 | 883 29.715 
784 28.000 | 834 28.879 | 884 29.732 
785 28.018 | 835 28.89 | 885 29.749 
786 28.036 | 836 28.914 | 886 29.766 
787 28.054 | 837 28.931 | 887 29.783 
788 28.071 | 838 28.948 | 888 29.799 
789 28.089 | 839 28.965 | 889 29.816 
790 28.107 | 840 28.983 | 890 29.833 
791 28.125 | 841 29.000 | 891 29.850 
792 28.142 | 842 29.017 | 892 29.866 
793 28.160 | 843 29.034 | 893 29.883 
794 28.178 | 844 29.052 | 894 29.900 
795 28.196 | 845 29.069 | 895 29.916 
796 28.213 | 846 29.086 | 89% 29.933 
297 28.231 | 847 29.103 | 897 29.950 
29.967 
29.983 


30.000 


30.017 
30.033 
30.050 
30.067 
30.083 


30.100 
30.116 
30,133 
30,150 
30.166 


30.183 
30.199 
30.216 
30.232 
30.249 


30.265 
30.282 
30.299 
30.315 
30.332 


30.348 
30.364 
30.381 
30.397 
30.414 


30.430 
30.447 
30.463 
30.480 
30.496 


30.512 
30.529 
30.545 
30.561 

30.578 


30.594 
30.610 
30.627 
30.643 
30.659 


30.676 
30.692 
30.708 
30.725 
30.741 


30.757 
30.773 
30.790 
30.806 
30.822 


N 


951 
952 


JN N 

30.838 | 1001 
30.854 | 1002 
30.871 | 1003 
30.887 | 1004 
30.903 | 1005 
30.919 | 1006 
30.935 | 1007 
30.952 | 1008 
30.968 | 1009 
30.984 | 1010 
31.000 | 1011 
31.016 | 1012 
31.032 | 1013 
31.048 | 1014 
31.064 | 1015 
31.081 | 1016 
31.097 | 1017 
31.113 | 1018 
31.129 | 1019 
31.145 | 1020 
31.161 | 1021 
31.177 | 1022 
31.193 | 1023 
31.209 | 1024 
31.225 | 1025 
31.241 | 1026 
31.257 | 1027 
31.273 | 1028 
31.289 | 1029 
31.305 | 1030 
31.321 | 1031 
31.337 | 1032 
31.353 | 1033 
31.369 | 1034 
31.385 | 1035 
31.401 | 1036 
31.417 | 1037 
31.432 | 1038 
31.448 | 1039 
31.464 | 1040 
31.480 | 1041 
31.496 | 1042 
31.512 | 1043 
31.528 | 1044 
31.544 | 1045 
31.559 | 1046 
31.575 | 1047 
31.591 | 1048 


31.607 | 1049 
31.623 | 1050 


31.639 
31.654 
31.670 
31.686 
31.702 


31.718 
31.733 
31.749 
31.765 
31.780 


31.796 
31.812 
31.828 
31.843 
31.859 


31.875 
31.890 
31.906 
31.922 
31.937 


31,953 
31.969 
31.984 
32.000 
32.016 


32.031 
32.047 
32.062 
32.078 
32.094 


32.109 
32,125 
32.140 
32.156 
32.171 


32,187 
32.202 
32.218 
32.234 
32.249 


32.265 
32. 280 
32.296 
32.311 
32.326 


32.342 
32.357 
32,373 
32,388 
32.404 


s pani ( 
a list of 
test publishers 


| 


Below is a list of the test publishers and distributors whose tests were referred 
to earlier in this book (the tests are briefly described in Appendix D). An aster- 
isk (*) after the name indicates that the company provides free bulletins on 


testing and the use of test results. All publishers will provide catalogues of their 
current tests. 


The names and 


addresses of other test publishers and distributors can be 
obtained from the 


latest volume of Buros’ Mental Measurements Yearbook. 


1. American Guidance Service, Inc. 
720 Washington Avenue, S.F. 
Minneapolis, Minnesota 55414 


6. Harcourt, Brace & World, Inc.* 
757 Third Avenue 
New York, New York 10017 


2. Bureau of Publications 7 


Teachers College, Columbia University 
New York, New York 10027 


. Houghton Mifflin Company* 
2 Park Street 
Boston, Massachusetts 02107 


3. California Test Bureau* 8. 
Del Monte Research Park 
Monterey, California 93940 


- Personnel Press, Inc. 
188 Nassau Street 
Princeton, New Jersey 08541 


4. Consulting Psychologists Press, Inc. 9. Psychological Corporation* 
577 College Street 


304 East 45th Street 


Palo Alto, California 94306 New York, New York 10017 


5. Cooperative Test Division* 10. 
Educational Testing Service 
Princeton, New Jersey 08541 


. Science Research Associates, Inc. 
259 East Erie Street 
Chicago, Illinois 60611 


406 


ee 


TgS-p -tuƏo2ər prom “Kre|nqeəoA *uorsuəuəiduro2 ‘ayey 08-0¢ €l+ (01) uonoəs Aaamg—sisay, surpeay 9nsousərq 
sz9-s uotsuayaiduios jo paads pue paq OF €1-8 (6) 18a], Surpray stag 
slsə r, Zuipray 
SIHJS £pnis ‘sarpnys 
MONT IBt9os ‘aouatos ‘asenSury ‘soneweyyeur ‘Surpeay 282 ZI-6 T249 100495 yõtH 
ST[PYS Ápms-y10m “sərpnis 
(S) MƏN Įppos “oəouəns 'əFenJuep vnəunpe ‘Surpeay SSZ-LZI 6-1 S[9AƏ'T Arequawayy 
(9) Sisay, JuaWAAaTyOYy piojup1ç 
Sorpnys perpos ‘auas 'sonpurəu1ptu och PI-OL TPAəT 100495 ysrpy 
Pers ‘esso “Suruaysty “unta ‘Sutpeai—syaaaz yog Ssp 6+ s[aaay Arequowayq 
(S) das 
—ssaido1g [PUONBONpyY Jo sisay, Tenuenbag 
1@-S 5 Apmys-Yiom ‘Benue, noue ‘Surpray O8b-09€ 6-1 (0T) sətiəS quawaaatyoy VYS 
SIHNS Apnis “sərpnis 
MON [eos ‘suas ‘oFensuvy ‘sornwweyyeur ‘Burpeəy SIg €1-6 TPAƏT Jooyas yŝtH 
spys Apmis ‘sorpnys 
81-p [Bos ‘soustos ‘FenJuej “nəunpnrm ‘Burpeəy eeZ-s0I 6-1 spaaay] Amuəuəs 
(9) sisa, UaUIaAarYOY uRI[odonayy 
sarpnys pros pue Əə9uərəs ‘sonetu 
LI-S cayyeu ysysug ur spys pue BurpurysIəpun 6SP ZI-6 (OI) weudopaaaq IPuonponp:i Jo ssa J, emoq 
9I-S sitys Apnys-y10m ‘osendury *onəunprip Surpray 61 6-€ (L) SINS o1seg jo S153, BMOT 
6 + ysy3ug ‘sarpnys peros *əəuərəs “sənpunonum]v coz oIl-6 (9) Araneg 1uəluop 100495 uSrH penuassq 
z-s asensuyy “sənpuupu ‘Surpray SLI PI-6 [PAa'T Jooyag yŝtH 
z-s asensury Snowe ‘surpray 8LI-68 6-1 Sere] Arejuowary 
x (£) slsə, Juawaaaryoy vuje 
sənəyeg uwaa y 
* #Painspa jy snəiy doloyy səmu:jy pasan0) * (ou S əysiqnq) awd Ay ay 


manay 


əun g 129unsə r 


4512437 appsg 


HAWW 
—— = a E 


SISAL GAZIGUVGNVLS 4O LISIT AALƏTTAS V 


408 


"O C m — 
(8) uonrtpə mxs 


9b£-S (qequaa Ái) 91098 ə[Burç og-Sz ZIA ‘s183 J, UJU; uosIəpu y-uuewjyny 
Bpe-S ros [210] pue 'əanelnuenb ‘equa, 0rF-0£ 91-41 [asa] ƏBə[[oD) 
Zhe- (Teq19a X[Q3ru) əroos ə[Butç Ob-0€ ZIE (L) Qultqy waw Jo s1sə.J, uospoy-uowuəH 
9%8-S (w10} peropid) Aqryiqe Burajos u|qorq 0@I-09 9-T (9) səuue9 s[[əq-stAEq 
(S) LVDS—sisaL 
gzg-s 940695 |240] pue “4nelnuenb “eqiƏA 0L 91 ánlqy əəjjoJ pue jooyəg əane1ədoog 
(S) MƏN 9109S [P10] pue *əƏəngup|[uou ‘ofensuey €8-8p V ‘91N (g) Auanqey [B1uə]N Jo 1591, g#turojr[p9 
(g) Anmwey 
(S) MƏN Ə109S [210] pue ‘adenZuyy-uow “3əunšuer Ebe V '9l- M [Bway Jo say, WIO J-Y Bruoj 
s1say, eprandy ansejoyag 
(‘aaoqe pasi 
“sərrəneq qWaUIaAaTYIB ur s1sə1 Burpear as os[y ) 
oss- awa ‘uorsuayardwos ‘Árejnqesoy Sp @L-9 (OL) provay Sutpray VYS 
(S) asa], ystsuq 
(S) MƏN uorsuayaiduios yo paads pur [aaa] ‘Areynqnoo0 A OP 1-6 aanesadoor) :uorsuəuəiduro9 Surpeay 
(p) Many uorsuəuəiduroo ‘Arenqesoy oe 6-£€ (L) 1591 Surpeay uosjay 
(p) MON ayer ‘uorsuayaiduioa ‘Areynqeoo A, og V ‘91-6 (L) 1591 Surpeay Auuac-uosjayy 
smp Jo (9) way, 
989-S uonuələ: “Burpear payoasp ‘uorsuayasduioo ‘wy SL €1-6 uotsuəuəiduro5 Surpeay auaaly-Aalay 
SyS Apnys-y10M ‘Bur 
68b-€ “uvaU əouəluəs pue prom 'uonuəugiduro2 Spy 6t-St gI- (9) S1891, Buripeay uəjıg emoj 
EE9-S uorsuəuəiduioə *KIp[nqgəoA *Aoeməse ‘paadg 09-SP OI-€ (Z) Aaaing Surpeay sag 
uotsuayaiduioo ‘Arey 2 
Ig9-s “NGRIOA ‘s[IeIap “suonoərp 'əəupəogrugis prəuə OL 8-€ (Z) sisal, Zurpeay dseg saeg 
0g9-¢ Surpeor ydeadesed tuontudoəər pioAN OP €-% (Z) S189], Surpeay Keung poourapy sag 


Bu[[əds pue Jungia ‘uonero 
-unuoid ‘sonauoyd “srsK[eup prom “uorsuəuərduroo 


099-9  BuruƏlsi[ “urpeə: peio pue quays ur sən[noutq Sb-08 9-I (9) Amətgtq Surpeay Jo sts£[euy [Iəzmq 
mə1inəyg **#PƏ41nspəp svaip 1ofppy ` Səmugy pə1əao9 = (OU Ssaysygng) əuupN zsə 1 
HAWW uag tFuusay 4512437 apviy 


E 2 22 5 


409 


SOT-S SANS sa SUES OF-0E Zi (OT) Aroyuaauy ymo VUS 
POL-S IAJE SB əurç St-0P 8 (OL) Z&iotuəsu[ 101unf VYS 
seare 
68-9 (eros pue '[guosrəd “[oouos ‘awoy ur sur[qorq oroz ¥‘9I-L (6) SIST YoayD wayqorg Aauoopy 
SISIPYIOYD urməqoiq 
OLS-P ssəurpgər: Jaquinu pue ssəurpeə: Surpeay 09 IN (9) sa], ssəurpeəy uejodonojy 
(S) MON spys Ssaurpear Surpeay 0 EN (g) 1591, ssəurpeoy Surpeay ym9-39T 
(2) sə[Uo1d 
LL9-S spys ssəurpgə: Surpeay 6L T-N ssəurpeəy Zurpeəy pnonç-uostire H 
99S-P spys ssəurpeə: Burpeay os I (Z) 3sƏ L ssəurpeəy Surpeay $3189 
s1sə, ssəurprəy 
Buruosuər ‘paads penydaoiod “qe 
(S) MƏN Teneds ‘Aqyiqe aaneyuenb “Jurugəur equa, 09-SS ZIA (OL) səntIIqy puw ¿mutiq VYS 
(agururpi3 pue Burjeds) əFesn ə8en3šup| “[8orrə]ə 
‘duruostar [morupuoəur ‘suonejar əəpds ‘Furuos 
(S) MON Rar qopnsqe ‘AqIqe [eoriəumu “uruospər peqia, 06I V ‘€I-8 (6) Sisay, apnindy jenuarayiq 
əZesn 
MƏN adenduyl “uruospər jovxsqe '[rəumu “eqISA 06 6-9 (6) slsəd Əsruro:q otwaproy 
səriəncg aprindy-nynyy 
MON (PqIa-uou) əroos ə[guts Sb-SE ZIN (OU) ANY Iezəuə9 jo ssa VYS 
T9£-S (IBq:9A ÁIQ81u) ə:oəs ə[Burç 0-6 9I-T (9) say Auprqy PW Surs09g-yond sno 
ose-s SəIoos I8gq3ə94-uou pue TIIA 1907 Zy (LZ) s1sə r, Əəuəsr||ə1uj Əsrpuiou L -ad107 
6bE-S a1098 Bur O€-SZ I-I (T) Sasa], uourq-uupurminy 
(8) uontpə yyuaaag 
MON ə109S [B10] pur ‘aaneuENd '[eq:ƏA O€-SZ ZIN ‘sisə, 90UaTT][91U] uosrəpuy-uupur[quny 
maiay +xPainspayy svaiy 1oloyy SINUN ee | s (OU S.aysyqng) əumN 153,1 
HAWN Əun] t9unsə L 4872427 appig 


(GUNNILNOD ) SISAL GIZIGUVGNVLS 40 LISIT GaLOaTaS V 


410 


Jooga nT us 
pamamas uonipa pjo ‘uontpa mau = (ç) may *(yooqreax Jrg ur Luqua puoəəs = Z-G “Fə) Knuə pup yooqivax syuaulainseayy [BATA 07 suafay 44 
‘ajquyimay səio2s oYt9ads əy} you q painsnaw spain posauad ay} SOWIIPUT yy 
‘sjana) əpp189 quasaffip m pasn swoj quasaffip 01 anp Ajuipu aim awn ut səBupy $ 
“(mpo = V '“uəqm8ipuy = y) əlqpnpoan sjaaa] əmmdəs fo daqunu əy} jou—Ajuo upds 77107 saaiy 4 


‘9 xipuaddy ur 382] ayy 07 safas (Sasayjuaind ur) Siaqunu ssaysryqnd ayy , 
@sI-S SONIANIB [OOyos-Jo-yno pue jooyos-ur snore A 09 L (OL) OG 9 NT 1 PYA 
(p) uəuo A 

698-S Bare yeioads q pue ‘suonednove Tg 10} sa109g OP V ‘ZI-Il 10} YUL]_ 1sə1ə1u[ [euongoo A Buong 

stare [eloads p pug ‘seare 

898-S [puoneooa perouo3 g ‘suorednoso pG 10} sax00g OP V ‘ZI-I1 (p) WOW 10} Juejg 1Asəzə1u] IeuongooA Buong 
Z98-S suonedno20 0ç 10} sai09g 0g-02 V ‘91-6 (OL) TeuonednosQ—psioovay svouasajorg rapny 
08-S suonear [Peros Jo saddy aary 0S-0P V ‘91-6 (0T) [buosiag—piodvay Iud Jopny 
€98-S SBoIe 4sərərur [BUONBIOA ferou? uə, 0S-OP V ‘91-6 (OT) Iguorig90 A—P4o9ə3 aouasayorg IPN 


SOOJUBAUT 1sə4əlu] 


sePainsvayy svasp soloy saqnuipy pasaao7y 
HAWW əun J 19unsə r 4512427 appig s ("OU SAIYSNQN G) ətunN 1531 
a 53 N EN O 


411 


Ability tests, 10 
Academic aptitude; see Scholastic aptitude 
Academic Promise Tests (APT), 237, 284, 
294, 410 
Accomplishment quotient, 283 
Achievement, versus aptitude, 10, 230-233 
Achievement tests; see Classroom tests, Es- 
say test items, Objective test items, 
Standardized achievement tests 
Activity checklists, 345-347 
Adjustment inventories; see Personality in- 
ventories, Problem checklists 
Administration of tests, 97, 203-205, 254-258 
Age equivalent, 280 
Age norms, 280-283 
Alternate forms; see Equivalent forms 
Alternative-response test, 127-133 
Anecdotal records 
advantages and limitations of, 311-313 
deciding what to observe, 310-311 
form for, 309 
improving effectiveness of, 313-314 
nature of, 13, 308-310 
uses of, 310 
Answer form 
for machine scoring, 260 
teacher-made, 201 
Aptitude, versus achievement, 10, 230-233 
Aptitude tests; see Multi-aptitude tests, Scho- 
lastic aptitude tests 
Attitude scales, 354-357 
Attitudes, as learning outcomes, 27 


Batteries; see Standardized achievement tests 

Behavior changes, 21; see also Learning out- 
comes, Objectives 

Bias in rating, 321-323 

Binet Seales, 240-241 

Blueprint, test; see Table of specifications 

California Achievement Tests, 225, 320, 370, 
408 


subject 
index 


California Short-Form Test of Mental Ma- 
turity, 235, 409 
California Test of Mental Maturity, 234, 409 
Central tendency, measures of, 387, 388 
Central tendency error, 322 
Checklists 
activity, 345-347 
characteristics of, 13, 325 
for evaluating “concern for others,” 327 
for evaluating skill in use of microscope, 
326 
in marking and reporting, 375-376 
problem, 347-348, 410 
Class interval, 391-392 
Classification exercise; see Interpretive €x- 
ercise 
Classroom tests (see also Essay test items, 
Objective test items) 
administration of, 203-205 
advantages of essay and objective, 108 
difficulty of, 112-113 
directions for, 198-203 
editing of, 195-197 
evaluation of, 207-215 
grouping of terms in, 197-198 
planning for, 48-51 
Preparing for use, 195, 203 
principles of construction, 109-117 
reproducing, 203 
scoring of, 190-193, 205-207 
types of, 104-107 
versus standardized tests, 223-224. 
Comparable forms, 98 
Completion test; see Short-answer test 
Complex achievement, measurement of, 160- 
193 
Concurrent validity, 70-71 
Construct validity, 72-74 
Content validity, 62-64. 
Cooperative School and College Ability Tests 
(SCAT), 231, 236, 409 


412 


Correction for guessing, 206-207 
Correlation coefficient 
computation of, 65-67, 397-400 
interpretation of, 67-68, 400 
Cost of testing, 99 
Creative ability “guess who” form, 335 
Creative activities checklist, 346 
Criterion, and validity, 71, 76 
Criterion-related validity, 64-71 
Culture-fair test, 233-234 


Davis-Eells Games, 233, 234, 409 
Davis Reading Test, 228, 408 
Derived scores, 274-275, 276; see also Norms, 
Scores, Standard scores 
Deviation 1Q, 289 
Diagnosing learning difficulties, 368-372 
Diagnostic analysis form, 370 
Diagnostic Reading Tests, 229 
Diagnostic Reading Tests: Survey Section, 
228, 408 
Diagnostic tests 
meaning of, 11-12 
in testing program, 268-269 
types of, 228-229 
Diagnostic Tests and Self Helps in Arithme- 
tic, 229 
Differential Aptitude Tests (DAT), 237, 239, 
410 
Difficulty of test, 93-94, 211 
Directions for test; see Test directions 
Discriminating power, 211-212 
Dispersion; see Variability 
Distracters 
determining effectiveness of, 212-213 
meaning of, 140 
plausibility of, 154-155 
Distractions, avoidance of in testing, 205, 255 
Distribution 
frequency, 391-393, 395, 396 
normal, 286-288, 291 
Dual-marking, 375, 377 
Durrell Analysis of Reading Difficulty, 229, 
409 


Editing test items, 195-197 

Education Index, 249 

Educational and Psychological Measurement, 
249 

Educational objectives; see Objectives 

Educational quotient, 283 

Enabling behaviors, 114 

Encyclopedia of Educational Research, 28 

Essential High School Content Battery, 226, 
231, 408 

Equal units, 272-273 

Equivalence, coefficient of, 83-84 

Equivalent-forms, 82, 98 

Equivalent-forms method, 83-84 


Subject Index 413 


Error of measurement; see Standard error of 
measurement 
Essay test items (see also Test items) 
advantages and limitations of, 184-187 
extended response, 107, 182-183 
form and uses of, 180-183 
learning outcomes measured, 184 
restricted response, 107, 181-182 
suggestions for constructing, 187-190 
suggestions for scoring, 190-193 
types of, 106-107 
versus objective, 11, 108 
Evaluation 
and improvement of learning, 361-372 
meaning of, 6-7 
planning for, 44-51 
principles of, 14-17 
procedures of, 9-14 
programs, 266-269 
relating to objectives, 44-57 
role of in marking and reporting, 372-380 
and teaching, 7-9 
Evaluation form for tests, 253 
Evaluation results 
reliability of, 79-96 
usability of, 96-99 
uses of, 8-9, 261-266, 361-372 
validity of, 59-78 
Examination; see Test 
Expectancy table, 68-70 


Factors influencing 
1Q scores, 242-243 
norms, 296-298 
reliability, 90-95 
test administration, 256-258 
test scores, 74-76, 300-301 
test selection, 250-253 
validity, 74-76 
Feedback, 365-366 
Forced-choice procedure, 349-351 
Forms 
comparable, 98 
equivalent, 82, 98 
Frequency polygon, 392-393 
Frequency distribution 
construction of, 391-392 
graphic presentation, 392-393 
use in computation, 395, 396 


Gates Advanced Primary Reading Test, 228, 
409 

Gates Primary Reading Tests, 228, 409 

Gates Reading Readiness Test, 241, 410 

Gates Reading Survey, 228, 409 

General educational development, tests of, 
226 

Generosity error, 321 

Goals; see Objectives 

Grade equivalent, 276 


414 Subject Index 


Grade norms, 276-281 
Grades, assignment of, 265; see also Mark- 
ing and reporting 
Group test, 12 
Grouped data, analysis of, 390-397 
Grouping pupils 
procedure illustrated, 4-5 
with sociometric results, 341-342 
and standardized tests, 262-263 
“Guess who” technique, 14, 333-335 
Guessing 
correction for, 206-207 
and directions, 202-203 


Haggerty-Olson-Wickman Behavior Rating 
Schedules, 322 

Halo effect, 322 

Hand scoring, 201, 258 

Handwriting scale, 320 

Harrison-Stroud Reading Readiness Profiles, 
241, 410 

Henmon-Nelson Tests of Mental Ability, 233 
409 

Histogram, 392-393 

Individual tests, 12, 240-241 

Informal achievement tests; see Classroom 
tests 


u 


Instruction (see also Teaching) 
goals of; see Objectives 
grouping for, 4-5, 262-263 
individualizing, 263-264 
planning for, 262 
Intelligence quotient (IQ) 
cautions in interpreting, 242-243 
deviation IQ, 240, 242, 283, 289 
factors influencing, 242-243 
ratio IQ, 242, 283 
Intelligence tests, 12, 229-243: see also Scho- 
lastic aptitude tests 
Interest inventories, 351-354, 411 
Internal consistency, coefficient of, 84-86 
Interpreting item analysis data, 213 
Interpretive exercise 
advantages and limitations of, 172-173 
forms and uses of, 161-171 
nature of, 161 
suggestions for constructing, 173-178 
use of pictorial materials, 167-17] 
Interquartile range, 389 
Interview, 12-13, 345 
Inventories; see Interest inventories, 
ality inventories 
Towa Tests of Basic Skills, 225, 408 
Towa Tests of Educational Development, 226_ 
231, 408 
Item analysis, 208-215 
Item arrangement, 197-198 
Item difficulty, 112-113, 211 
Item discriminating power, 211-212 


Person- 


Lem file, 215-216 
Item validity, 213 


Journal of Consulting Psychology, 249 
Journals, for test reviewers, 249 


Kelley-Greene Reading Comprehension Test, 
228, 409 . 
Key-type items; see Interpretive exercise 
Knowledge 
measurement of, 120-138, 142-145 
types of, 27 
Knowledge of progress, 365-366 
Kuder Preference Records, 351-352, 411 
Kuder-Richardson Íormula, 85 
Kuder-Richardson method, 85-86 
Kuhlmann-Anderson Intelligence Tests, 233, 
236, 409, 410 
Kuhlmann-Finch Intelligence Tests, 233, 410 


Learning and evaluation 
clarifying goals, 362-363 
diagnosing and remedying difficulties, 368- 
372 
increasing retention and transfer, 366-368 
motivating learning, 364-366 
understanding the learner, 363-364 
in teaching, 8 
Learning outcomes (see also Objectives) 
meaning of, 21 
and nontest procedures, 54-55 
and standardized tests, 55-56 
and test items, 50-54 
types of, 26-27 
Lee-Clark Reading Readiness Test, 241, 410 
Length of test, 90-91 
Letter grade; see Marking and reporting 
Local norms, use of, 298-300 
Logical error, 323 
Lorge-Thorndike Intelligence Tests, 231, 234, 
410 


Machine scoring, 259-260 

Marking and reporting 
to administrators, 374 
checklist for, 375-376 
functions of, 372-374 
letter grades, 375, 379-380 
parent-teacher conference, 376-378 
principles of, 378-380 
to pupils and parents, 373 
report form, 377 
to teachers and counselors, 373 
traditional marking system, 374-375 
types of, 374-378 

Mastery tests, 11-12 

Matching test 
advantages and limitations of, 135-136 
characteristics of, 105-134 


suggestions for constructing, 136-138 
uses of, 134-135 
Mean 
assumed, 396 
computation of, 387, 395-397 
meaning of, 286, 387 
Measurement, meaning of, 6 
Median 
computation of, 387-388, 393-395 
meaning of, 285, 387 
Mental ability tests; see Scholastic aptitude 
tests 
Mental age, 240, 242 
Mental Measurements Yearbooks, 225, 
247-248, 251 
Metropolitan Achievement Tests, 225, 
408 
Metropolitan Readiness Test, 241, 410 
Modal-age forms, 280 
Mode, 388 
Mooney Problem Check Lists, 347-348, 410 
Motivating learning, 364-366 
Motivation of pupils tested, 256 
Multi-aptitude tests, 237-240, 410 
Multiple-choice test 
advantages and limitations of, 147-149 
characteristics of, 106, 140-142 
measuring knowledge outcomes, 142-145 
measuring understanding, 145-147 
Suggestions for constructing, 149-157 
uses of, 142-147 
Murphy-Durrell Diagnostic Reading Readi- 
ness Test, 241 


227, 


295, 


Nelson-Denny Reading Test, 228, 409 
Nelson Reading Test, 228, 409 
Nomination technique; see Peer appraisal 
Nontesting procedures, 307-308 
Nonverbal tests, 12, 234-237 
Normal curve, 286-288, 291 
Normalized scores, 288 
Norms 

age, 280-283 

age-controlled, 280 

grade, 276-280 

judging adequacy of, 293-298 

local, 298-300 

meaning of, 275 

modal-age, 280 

Percentile, 283-286 

standard score, 286-291 

types of, 276 

use of, 298-300 


Objective scoring, 11 

Objective test items (see also Test items) 
interpretive exercise, 160-179 
matching, 134-138 
multiple choice, 140-158 
short answer, 121-126 


Subject Index 415 


true-false, 127-133 
types of, 105-106 
versus essay, 11, 108 
Objectives 
determining the adequacy of, 40-50 
dimensions of, 22-26 
and evaluation procedures, 44-57 
examples of, 30-40, 46 
as learning outcomes, 21-22, 26-27 
method of determining, 29-40 
procedural steps for determining, 34-35 
role in evaluation and teaching, 7-8 
role in learning, 361-372 
role in marking and reporting, 372-380 
sources of suggestions for, 28-29 
taxonomy of, 28 
types of, 26-27 
Objectivity, 93-94 
Observational techniques 
anecdotal records, 308-314 
checklists, 325-328 
learning outcomes measured, 308 
ranking methods, 317 
rating scales, 314-324 
types of, 13-14, 307-308 
Odd-even reliability; see Split-half method 
Otis Quick-Scoring Mental Ability Test, 233, 
410 


Parallel forms; see Equivalent forms 
Paried-comparison method, 317 
Peer appraisal 
“guess who” technique, 333-335 
nature of, 308, 332 
social relations scales, 343-344 
sociometric technique, 335-342 
Percentage correct score, 274 
Percentile norms, 283-286 
Percentile rank, 283, 288, 291 
Performance test, 12 
Personal-social development, evaluation of, 
305-358 
Personality inventories, 348-350; 
Problem checklists 
Personnel and Guidance Journal, 249 
Pictorial materials in testing, 167-171 
Power test, 12 
Practicality; see Usability 
Prediction efficiency of correlation coeffi- 
cients, 68 
Predictive validity, 64-70 
Pretest, 364 
Problem checklists, 347-348, 410 
Procedure evaluation, 305-358 
Product evaluation, 305-358 
Product-moment correlation, 397-400 
Product scale, 319-320 
Profiles, 291-295 
Project TALENT, 345-346 
Projective techniques, 350-351 


see also 


416 Subject Index 


Psychological Abstracts, 249 
Published tests, in print, 248 
Publishers of tests, 406 


Quartile deviation, 389, 393-395 

Quartiles, 389-390, 393-395 

Questionnaires; see Interest inventories, Per- 
sonality inventories 

Quotients, use of, 283 


Range of scores, 388-389 
Rank-difference correlation, 65-66, 397, 400 
Ranking methods of rating, 317 
Ranking test scores, 386-387 
Rating, pupil Participation in, 328 
Rating scales 
characteristics of, 14, 314-315 
common errors in, 321-323 


in evaluating personal-social development, 
321 


in evaluating procedures, 318 
in evaluating products, 318-320 
principles of effective use, 323-324 
and ranking methods, 317 
relation to objectives, 55 
types of, 315-317 
Raw scores, 273-274 
Reading Comprehension: 
lish Test, 228, 299, 409 
Reading readiness tests, 241, 410 
Reading tests, 228, 241, 408-409, 410 
Recording 
events during testing, 257 
test items, 195, 210 
test results, 261 
Records 
anecdotal, 13, 308-314 
cumulative, 261 
Reliability 
comparison of methods, 82, 86-87, 94 
equivalent-forms method, 83-84 
factors influencing, 90-95 
interpretation of, 94-96 
Kuder-Richardson method, 85-86 
meaning of, 60, 80-81 
relation to validity, 60, 80-81 
split-half method, 84-85 
standard-error of Measure 
test-retest method, 82-83 
types of, 82, 87, 94 
Reliability coefficients, 81-87 
Remedying learning difficulties, 368-372 
Reporting (see also Marking a 
pupil progresses, 9, 372-380 
test results, 261, 382 
Response set, 75-76 
Retention of learning, 366-368 
Retest method, 82-83 


Review of Educational Research, 28, 249 


Cooperative Eng- 


ment, 87-90 


nd reporting) 


Reviews, test, 249, 251-252 
Rorschach Inkblot Test, 351 


Sequential Tests of Educational Progress 
(STEP), 225, 226, 292, 408 
Scales; see Attitude scales, Rating scales, 
Social relations scales 
Schocl testing program, 266-269 
Scholastic aptitude tests (see also Standard- 
ized tests) 
cautions in interpreting, 241-243 
culture-fair, 233-234 
group tests, 233-240, 409-410 
individual, 240-241 
multi-score, 237-240 
role in testing program, 267-268 
single score, 233-234 
type of learning measured, 231 
verbal and nonverbal, 234-237 
versus achievement tests, 230-233 
Science activities checklist, 347 
Score bands, 88, 291-295, 301 
Scores 
age equivalent, 280-283 
derived, 274-275 
grade equivalent, 276-280 
interpretation of, 300-301 
percentile, 283-286 
raw, 273-274 
standard, 286-291 
Scoring. 
correction for guessing in, 206-207 
ease of, 97-98 
essay tests, 190-193 
hand, 201, 258 
keys, 205 
machine, 259-260 
objective tests, 205-207 
standardized tests, 258-260 
stencil, 205 
Selection-type items, 105-106 
Self-rating, 328 
Self-report techniques 
activity checklists, 345 
attitude scales, 354-357 
interest inventories, 351-354 
nature of, 12-13, 308 
personality inventories, 348-350 
problem checklists, 347-348, 410 
projective techniques, 350-351 
use in testing program, 269 
Semi-interquartile range, 389 
Severity error, 322 
Short-answer test 
advantages and limitations of, 123-124 
characteristics of, 105, 121 


suggestions for constructing, 124-126 
uses of, 121-123 


Skills, 27, 308, 318-320, 325-326 
Social relations scales, 343-344 


Í—— mm 


Sociogram, 340-341 
Sociometric technique 
characteristics of, 14, 335-336 
form for, 337 
matrix table, 339 
sociogram, 340-341 
tabulating results of, 336-340 
uses of, 341-342 
Spearman-Brown formula, 85 
Specific determiners, 116 
Specifications; see Table of specifications 
Specimen sets, 251 
Speech rating scale, 319 
Speed test, 12 
Split-half method, 84-85 
Spread of scores 
measures of, 388-390 
and reliability, 91-93 
Square root tables, 401 
SRA Achievement Series, 225, 408 
SRA Junior Inventory, 347, 349, 410 
SRA Primary Mental Abilities Test (PMA), 
237, 238, 410 
SRA Reading Record, 209, 409 
SRA Tests of General Ability, 410 
SRA Youth Inventory, 348, 410 
Stability, coefficient of, 82-83 
Standard deviation 
computation of 389-390, 395-397 
and standard error, 88 
use with standard scores, 286-288 
Standard error bands, 88, 291-295, 301 
Standard error of measurement, 87-90 
Standard scores 
advantages of, 290-291 
comparison of, 290-291 
computation of, 288-289 
deviation IQ, 289 
normal distribution of, 287, 291 
stanines, 290-291 
T-scores, 289, 291 
z-scores, 288-289, 291 
Standard achievement tests (see also Stand- 
ardized tests) 
batteries, 225-226, 408 
characteristics of, 222-223 
diagnostic, 228-229 
reading, 228 
role in testing program, 267-268 
in specific areas, 226-228 
type of learning measured, 231 
vesus classroom tests, 223-224 
versus scholastic aptitude tests, 230-233 
Standardized scholastic aptitude tests; see 
Scholastic aptitude tests 
Standardized tests (see also Scholastic apti- 
tude tests, Standardized achievement 
tests) 
administration of, 254-258 
characteristics of, 221-223 


Subject Index 417 


classification of, 248 
evaluation form for, 253 
list of, 407-411 
relating to objectives, 55-56 
scoring of, 258-260 
selection of, 246-254 
sources of information about, 247-250 
steps in selecting, 250-254 
testing program of, 266-269 
uses and misuses of, 261-266 
versus informal, 11, 223-224 
Stanford Achievement Test, 225, 408 
Stanford-Binet Intelligence Scale, 240-241 
Stanines, 290-291 
Statistical methods, 65-66, 286-288, 385-400 
Strong Vocational Interest Blanks, 352-353, 
411 
Supply-type items, 105 
Survey tests, 11-12 
Syracuse Scales of Social Relations, 343-344 


T-scores, 289, 291 
Table of specifications 
construction of, 49-51 
importance of, 104 
use in content validity, 63 
Taxonomy of Educational Objectives, 28, 36, 
42, 58 
Teacher-made answer sheet, 201 
Teacher-made tests; see Classroom tests 
Teaching 
effectiveness, 266 
and improvement of learning, 361-372 
relation to evaluation, 7-9 
Technical Recommendations, 61, 81, 289 
Test (see also Classroom tests, Standardized 
tests) 
administration, 97, 203-205, 254-258 
difficulty, 93-94 
directions, 198-203, 256-257 
essay, 106-107 
evaluation of, 207-215, 252-253 
interpretation, 98, 300-301 
length, 90-91 
objective, 105-106 
practical characteristics of, 96-99 
preparing for use, 195-203 
profiles, 291-295 
reproducing, 203 
scores, interpreting, 272-301 
scoring, 190-193, 205-207, 258-260 
selection, 246-254 
uses, 261-266 
Test bulletins; see list at end of each chapter 
Test construction 
essay, 180-193 
interpretive exercise, 160-179 
matching, 134-138 
multiple-choice, 140-158 
principles of, 109-117 


418 Subject Index 


Test construction (cont'd) 
short answer, 121-126 
true-false, 127-133 
Test item card, 195, 210 
Test item file, 215-216 
Test items (see also Essay test items, Objec- 
tive test items) 
analysis of, 207-215 
arranging in test, 197-198 
difficulty of, 112-113 
editing, 195-197 
file for, 215 
principles of construction, 109-117 
recording, 195, 210 
relating to objectives, 50-54, 111-112 
reviewing, 195-197 
types of, 104-107, 110 
Test materials 
ordering, 254 
review of, 195-197, 251-253 
study of, 255-256 
Test publishers, 406 
Test publishers catalogues, 249 
Test reviews, 249, 251-252 
Test-retest method, 82-83 
Testing (see also Classroom tests, Standard- 
ized tests) 
cost of, 99 
planning for, 48-51, 246-254 
procedures of, 11-12 
program, 266-269 
selecting time of year for, 269 
Tests in Print, 227, 247-248, 251 
Tests, selected list of, 407 
Thematic Apperception Test (TA T), 351 
Thinking skills 
measurement of, 128-129, 160-178, 180-190 
types of, 27 
Time of year for testing, 269 
Timing tests, 257 


Transfer of learning, 366-368 

True-false test 
advantages and limitations of, 129-131 
characteristics of, 105, 127 
suggestions for constructing, 131-133 
uses of, 127-129 

True score, 87-88 

Typical behavior 
evaluation of, 305 
meaning of, 10 


Understanding 

measurement of, 128, 145-157 

types of, 27 
Ungrouped scores, analysis of, 386-390 
Usability, 96-99 


Validity 

concurrent, 70-71 

construct, 72-74 

content, 62-64, 194-195 

criterion-related, 64-71 

factors influencing, 74-76 

meaning of, 59-61 

predictive, 64-70 

relation to reliability, 60, 80-81 

types of, 61 
Validity coefficients, interpretation of, 67-68 
Variability, measures of, 388-390 
Verbal tests, and nonverbal tests, 12, 234-237 
Vocational interests inventories, 351-354, 411 


Wechsler Intelligence Scale for Children, 
240-241 

Wechsler Adult Intelligence Scale, 240-241 

What I Like To Do—An Inventory of Chil- 
drens Interests, 352-353, 411 


z-scores, 288-289, 291 
Zero point, true, 272-273 


Adkins, Dorothy C., 18 

Ahman, J. S., 42, 107, 118, 178, 181, 193, 329, 
381 

Anastasi, Anne, 77, 100, 245, 271, 303, 358 

Anderson, H. R., 159 

Arny, Clara B., 118 


Barnard, J. D., 33 

Bauernfeind, R. H., 78, 100, 271, 303, 349, 358 

Beck, J. M., 245 

Berg, H. D., 159 

Blair, G. M., 366, 381 

Bloom, B. S., 15, 23, 27, 28, 36, 42, 58, 118, 
159 

Bloomers, P. J., 400 

Bonney, M. E., 342, 358 

Bradfield, J. M., 118 

Bradley, J. L., 400 

Burke, P. J., 400 

Buros, O. K., 225, 227, 228, 229, 240, 241, 247, 
248 

Burton, W. H., 381 


Chauncey, H., 18, 118 

Cook, W. W., 365 

Crites, J. O., 358 

Cronbach, L. J., 56, 67, 73, 75, 76, 78, 80, 81, 
85, 86, 93, 95, 100, 148, 202, 213, 230, 
241, 243, 245, 271, 282, 297, 301, 329, 358, 
363 

Cunningham, Ruth, 334 


Damrin, Dora E., 118, 193, 217 

Davis, Allison, 234 

Davis, F. B., 207, 245, 246, 271, 303, 381 
Diederich, P., 100, 218 

Dobbin, J. E., 18, 118, 382 

Doppelt, J. E., 89, 90, 218 

Downie, N. M., 188 

Dressel, P. L., 18, 42, 118, 178, 381 
Durost, W. N., 280, 295, 303, 382 


author 
index 


ae ae 


Dutton, W. H., 118 
Dyer, H. S., 57 


Ebel, R. L., 58, 78, 105, 107, 118, 148, 161, 
172, 175, 193, 217, 271 

Edwards, A. L., 355, 356 

Eells, K., 234 

Ellis, A., 349 

Engelhart, M. D., 245 


Feldt, L. S., 329 

Ferris, F. L., 58 

Findley, W. G., 58, 245, 271, 303, 382 

Flanagan, J. C., 346 

French, Will, 28 

Furst, E. J., 42, 58, 114, 142, 158, 178, 199, 
213, 217 


Gage, N. L., 118 

Gardner, E. F., 232, 249, 344 

Garrett, H. E., 400 

Gates, A. I., 281 

Gerberich, J. R., 22, 27, 58, 119, 139, 159 

Glock, M. D., 42, 107, 118, 178, 181, 193, 329, 
381 

Good, W. E., 245 

Goslin, D. A., 245 

Green, J. A., 139, 193 

Greene, H. A., 119 

Gronlund, N. E., 336, 337, 338, 339, 341, 342 

Guilford, J. P., 15, 400 


Hagen, Elizabeth, 18, 68, 139, 142, 193, 241, 
245, 275, 279, 298, 303, 330, 358, 373, 380 

Harris, C. W., 28, 31, 245 

Hart, Irene, 303 

Heenan, D. K., 159 

Heil, L. M., 164 

Henry, N. B., 58, 164 

Hoyt, C. J., 100 


419 


420 Author Index 


"Jones, R.S., 366, 381 
Jorgensen, A. N., 119 
Justman, J., 330 


Karnes, M. R., 119 

Katz, M., 217, 271 
Kearney, N. C., 28 
Krathwohl, D. R., 42, 58 
Kuder, G. F., 82, 85, 100, 352 


Lado, R., 119 

Lamke, T. A., 233 

Lennon, R. T., 19, 

Lev, J., 400 

Likert, R., 356 

Lindquist, E. F., 23, 80, 85, 86, 118, 148, 159, 
161, 172, 175, 185, 186, 400 

Lindvall, C M., 158 

Lyman, H. B., 303 


78, 271 


McClelland, J. N., 400 
McCune, G. H., 58, 165, 179 
Mager, R. F., 43 
` Masia, B. B., 42, 58 
Mayhew, L. B., 178 
Meehl, P. E., 73 
Merwin, J. C., 232 
Meyers, C. E., 353 
Michaelis, J. U. , 327 
Micheels, W. J., 119 
Mooney, R. L., 348 
Moredock, H. S., 118 
Morse, H. T., 58. 165, 179 


Nelson, C. H., a 166, 179, 381 
Nelson, M. J., 

Noll, V. H., pr 

North, R. D., 271 

Nunnally, J. C., 245 


Odell, C. W., 139 
Ohlsen, M. M. 303 
Olson, W. C., 321 


Page, E. B., 365 

Phillips, B. N. , 258 
Popham, E. L., 119 
Prescott, G. A. , 303 


Reade, Marybell, 217 
Remmers, H. H., 349, 355, 356 
Richardson, M. W. , 82, 85, 100 


Ricks, J. H., 297, 382 
Robbins, I., 330 
Russell, D. H., 161 


Saupe, J. L., 100, 381 
Sax, G., 217 

Schrader, W. B., 303 
Schwartz, A., 43, 330 
Sea, Marcella R., 353 
Seashore, H. G., 245, 297, 303 
Sells, S. B., 358 

Simpson, R. H., 350, 366 
Smith, Ann Z., 382 
Smith, E. R., 23, 161 
Stalnaker, J. M., 185, 186 
Stanley, J. C., 139 
Stodola, Q., 119 

Strong, E. K., 353, 354 
Stroud, J, B., 245 

Stull, H., 159 

Super, D. E., 358 


Thomas, R. M., 43, 48, 330, 382 

Thompson, G 344 

Thorndike, R. L., 18, 68, 80, 85, 86, 139, 142, 
193, 241, 245, 275, 279, 298, 303, 330, 
358, 373, 380 

Thorpe, L. P., 353 

Thurstone, L. L., 237, 355 

Thurstone, Thelma G., 238 

Tiedeman, S. C., 43, 330 

Tinkelman, S. N. 217 

Torrance, E. P., 245, 335, 346 

Townsend, E. A..400 

Travers, R. M. V. , 35, 142, 148, 149, 158, 178, 
199 

Traxler, A. E., 271, 310 

Trites, D. K. , 358 

Tyler, R. W., 18, 23, 24, 161, 326, 367 


Vaughn, K. W., 118 


Walker, H. H.,42 

Walker, H. M., 400 

Wallen, N. E., 234 

Wardeberg, H. L., 42, 118, 329, 381 
Weathers, | GR. 258 

Wesman, A. G., 70, 78, 96, 100, 231, 245 
Willgoose, C. E, 119 

Wood, Dorothy À, 158, 193, 211, 217 
Wrightstone, J; W. , 18, 321, 330, 382 
Wrinkle, W. L., 382 


~~) i 
Se Calcutta a 
. B. ç. w 


Form No. 3. 
F PSY, RES.L-1 


Bureau of Educational & Psychological 
Research Library. 


—  .. u U U Uu u —— 


The book is to be returned within 
the date stamped last. 


WBGP-59/60-5119C-5M 


